When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.
Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.
As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.
Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?
This question requires no explicit answer from you, and no space is provided for an answer in the template file.
However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.
Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.
The essence of correlation and convolution is best understood in the one-dimensional case, since multi-dimensional versions of these operators are straightforward extensions of their one-dimensional versions.
In this part we denote correlation with the symbol '$\ast_+$' and convolution with the symbol '$\ast_-$' (in LaTeX, $\ast_+$
and $\ast_-$
). When doing mathematics, it is easiest to think of the inputs to a correlation or a convolution as infinite sequences.
The correlation $r = a \ast_+ b$ of these sequences $a$ and $b$ then, is simply the infinite sequence
$$ r_j = \sum_i a_i b_{j+i} \;. $$The entries $r_j$ in the sequence $r$ are defined for all integers $j$ and the summation extends over all integers $i$.
Similarly for the convolution $c = a \ast_- b$:
$$ c_j = \sum_i a_i b_{j-i} \;. $$Because all possible values of $r$ and $c$ are computed, this view corresponds to the full
correlation or convolution we saw in class.
Use the notation above and a change of variables to show that convolution commutes:
$$ a \ast_- b = b \ast_- a \;. $$Correlation, on the other hand, does not commute. This is an important reason for preferring convolution to correlation whenever possible.
Use a change of variable to show that $b \ast_+ a$ is the sequence $a \ast_+ b$ in reverse order: $(a \ast_+ b)_j = (b \ast_+ a)_{-j}$.
This proof is similar to the previous one.
Finite sequences $a$ and $b$ can be accommodated into the mathematical notation described in Part 1 by thinking of them as being padded with an infinite number of zeros on both sides. However, we can only work with finite sequences when programming. This is when we need to pay close attention to the distinction between full
, valid
, and same
mode correlations and convolutions.
The code retrieved and imported below uses the numpy
functions correlate
and convolve
to define a new function with header
def oracle(operation, a, b, mode):
that computes correlations or convolutions in all three modes.
In addition, the function with header
def show(operation, a, b, mode, c):
also imported below, prints a line that describes the operation and shows its output, as illustrated by the function test
defined and called below.
from urllib.request import urlretrieve
from os import path as osp
def retrieve(file_name, semester='fall22', course='371', homework=10):
if osp.exists(file_name):
print('Using previously downloaded file {}'.format(file_name))
else:
fmt = 'https://www2.cs.duke.edu/courses/{}/compsci{}/homework/{}/{}'
url = fmt.format(semester, course, homework, file_name)
urlretrieve(url, file_name)
print('Downloaded file {}'.format(file_name))
retrieve('oracle.py')
Using previously downloaded file oracle.py
from oracle import oracle, show
def test(function, a=[1, 2, 3, 4], b=[5, 6]):
for operator in ('correlate', 'convolve'):
for mode in ('full', 'same', 'valid'):
for f, g in ((a, b), (b, a)):
c = function(operator, f, g, mode)
show(operator, f, g, mode, c)
test(oracle)
correlate([1, 2, 3, 4], [5, 6], full) = [6, 17, 28, 39, 20] correlate([5, 6], [1, 2, 3, 4], full) = [20, 39, 28, 17, 6] correlate([1, 2, 3, 4], [5, 6], same) = [6, 17, 28, 39] correlate([5, 6], [1, 2, 3, 4], same) = [39, 28, 17, 6] correlate([1, 2, 3, 4], [5, 6], valid) = [17, 28, 39] correlate([5, 6], [1, 2, 3, 4], valid) = [39, 28, 17] convolve([1, 2, 3, 4], [5, 6], full) = [5, 16, 27, 38, 24] convolve([5, 6], [1, 2, 3, 4], full) = [5, 16, 27, 38, 24] convolve([1, 2, 3, 4], [5, 6], same) = [5, 16, 27, 38] convolve([5, 6], [1, 2, 3, 4], same) = [5, 16, 27, 38] convolve([1, 2, 3, 4], [5, 6], valid) = [16, 27, 38] convolve([5, 6], [1, 2, 3, 4], valid) = [16, 27, 38]
Note that the innermost for
loop in test
considers the two input sequences in either order, so you can verify that the commutativity or anti-commutativity properties you proved in the previous part also hold in the finite case.
You should consider the function oracle
as an oracle that tells you the true value of a convolution or a correlation. The goal of this part of the assignment is to define two new functions that replicate what oracle
does but without using numpy
. This will give you an opportunity to understand these operations more concretely.
Valid
Convolution¶From the programming point of view, convolution is easier than correlation because of commutativity: If it is convenient to flip the order of the arguments, than it is OK to do so without changing the result. In addition, the valid
mode is the easiest one to program because it involves no input padding or special treatment of the start or finish of any of the sequences involved. All you need to do it to get the indices right.
Because of this, you will reimplement the oracle by first implementing the valid
convolution and then defining a second function that adapts the valid
convolution to the full
and same
modes, and then to the three modes of correlation.
Define a function with header
def valid_convolution(a, b):
that returns (without printing anything) the valid
convolution of the two finite sequences a
and b
. All sequences are represented as Python lists of numbers (integer or float, does not matter).
Show your code and the result of the calls
a, b = [1, 2, 3, 4], [5, 6]
print(valid_convolution(a, b))
print(valid_convolution(b, a))
You will get no credit if your implementation imports numpy
or any other package.
Valid
Convolution¶It it probably best to work out a small example to avoid getting confused with index arithmetic. Let $\ell$ be the longer of $a, b$ and $s$ be the shorter.
If $|a| = 7$ and $|b| = 4$, then
$$ \begin{eqnarray*} a &=& \ell = [l_0, l_1, l_2, l_3, l_4, l_5, l_6] \\ b &=& s = [s_0, s_1, s_2, s_3] \;. \end{eqnarray*} $$The valid
convolution of $a$ and $b$ is obtained by flipping the shorter sequence $s$ (that is, reversing the order of its entries) and then sliding it along $\ell$ left to right. For all positions where the flipped $s$ and $\ell$ overlap fully, compute the product of corresponding entries, add them up, and store the sum in the proper entry of $c$. If $\sigma$ is the flipped $s$,
then we have the following set of positions and corresponding values of $c_j$, using Python-style subscripts:
A moment's reflection will show that this result can be generalized to the following general formula for the one-dimensional valid
convolution:
where $|x|$ denotes the length of sequence $x$.
In code, it is safer to use ell
rather than l
to denote $\ell$, since l
may be confused with the number one in some typefaces.
The expression for $c_j$ above is written with "Pythonic" notation for the ranges of indices. That is, $0 \leq i < |s|$ can be implemented as i in range(len(s))
and $0 \leq j < |\ell| - |s| + 1$ becomes j in range(len(ell) - len(s) + 1)
.
For a particular value of $j$, the summation that defines $c_j$ is an inner (or dot) product of s
with ell[j:j+len(s)]
, which are two sequences of length $|s|$. Because of this, it may be simpler to first define a helper function dot
that computes this dot product, and then let valid_convolution
call that function in a loop over $j$.
List comprehensions are elegant and compact ways to implement both dot
and valid_convolution
, but feel free to use standard loops instead.
To flip s
say s[::-1]
. While there are several other ways to reverse a list in Python, they all have their drawbacks for this application. The most readable option is to define a helper function, which will turn out to be useful later on as well:
def flip(s):
return s[::-1]
You may want to check that the outputs from your two test calls are consistent with the output of the test(oracle)
call shown earlier.
The same
mode correlation (or convolution) operator takes two finite sequences $a$ and $b$ and returns a correlation (or convolution) sequence $c$ that has the length of the longer of $a$ and $b$. Specifically, the length of $c$ is
To accomplish this, the longer sequence is padded with a total number of zeros that is one less than the length of the shorter sequence. That is, if $m = \min(|a|, |b|)$, then the total amount of padding is $p = m - 1$ zeros. This padding is split approximately in half, and the two pieces are concatenated to the two ends of the longer sequence. The same
correlation (or convolution) of $a$ and $b$ is then the valid
correlation (or convolution) of the two sequences after the longer one has been padded in this way.
The trouble with this description is the word "approximately." If $m$ is even, then $p$ is odd, and the padding cannot be split evenly, so there are two ways to do the split.
For instance, in our running example we have $|a| = 7$ and $|b| = 4$ so that $m = \min(7, 4) = 4$ and $p = 4 - 1 = 3$. Then, the longer sequence, which is $a$, is padded with a total of 3 zeros, to make it 10 items long. Since $p$ is odd in this case, we need to decide which of the two padding methods below numpy.correlate
and numpy.convolve
actually use:
Let us call the first one "short left padding" and the second one "long left padding."
Either use your valid_convolution
or paper and pencil to figure out the results of the same
convolution of the sequences $a = [1, 2, 3, 4, 5, 6, 7]$ and $b = [8, 9, 10, 11]$ with each of the two padding alternatives above.
Then compare your results with the printout from appropriate calls to oracle
with the same
mode option and state which padding style (short left or long left) numpy.convolution
uses.
numpy.convolution
, use the function show
imported earlier from oracle
.oracle
or show
either by looking at the function test
defined earlier or by examining the code in oracle.py
.numpy.correlation
uses the same padding style as numpy.convolution
, so there is no need to check that as well.Now you know the padding style for the same
option for both convolution and correlation.
The full
padding style pads the longer sequence $\ell$ with $|s| - 1$ zeros on each side, where $s$ is the shorter sequence, and then performs a valid
convolution of the padded $\ell$ with $s$.
Write a function with header
def apply(operation, a, b, mode):
that replicates what oracle
does, but without using numpy
.
Show your code and the result of the call
test(apply)
Again, you will get no credit if your implementation imports numpy
or any other package.
operation
is 'convolve'
, pad the longer sequence appropriately for each mode
, if needed, and then call valid_convolution
.operation
is 'correlate'
, call apply
recursively with appropriate arguments. A single call here takes care of all three modes. However, if mode
is 'same'
and one of the two sequences needs to be both flipped and padded, you need to do the padding before you call apply
recursively, and then change the mode to 'valid'
in the recursive call. Otherwise, the padding will happen on the wrong side because of the flip. Work out a small example to figure this out.same
mode given the length of the shorter sequence.test(apply)
gives the same outputs as the call test(oracle)
done earlier.To keep things simple, we stay in the one-dimensional case for this part. We consider the full
convolution here. So
is the full
convolution of sequences $a$ and $b$. The subscript $-$ means convolution, as before, and the superscript $f$ means full
. We use superscripts $v$ and $s$ for valid
and same
if needed.
We assume $a$ and $b$ to be finite with lengths $|a|$ and $|b|$. The length of $c$ is therefore $|c| = |a| + |b| - 1$.
We also repurpose the symbol $\ell$ here. Consistently with usage in the class notes, $\ell$ now is the loss function used to train a neural network.
Suppose now that you are given the vector with the numerical values of the gradient of the loss function $\ell$ with respect to the output $c$ of the full
convolution above for some specific data sample:
This gradient is a vector in $\mathbb{R}^{|c|}$.
To back-propagate through the convolution above means to compute the gradients of $\ell$ with respect to the inputs $a$ and $b$ of the convolution, that is, the two vectors
$$ \alpha = \frac{\partial\ell}{\partial a} \in \mathbb{R}^{|a|} \;\;\;\text{and}\;\;\; \beta = \frac{\partial\ell}{\partial b} \in \mathbb{R}^{|b|} \;. $$Give expressions for $\alpha$ and $\beta$ in terms of $a$, $b$, and $\gamma$. Express your results as simply as possible in terms of convolutions or correlations, specifying also the mode (full
, valid
, same
).
Assume that $|a| \geq |b|$.
To obtain your expressions, start from an example similar to that in problem 2.1 above but a bit smaller, say with $|a| = 4$ and $|b| = 3$. Then use the chain rule of differentiation to write formulas for each entry of $\alpha$ and $\beta$ in terms of the individual entries of $a$, $b$, and $\gamma$. Finally, examine the pattern of the result and think how you can rewrite it in compact form.
To get you started, the homework template file contains the following equations that spell out that convolution directly in terms of the entries of $a$ and $b$.
The tiny neural network in the figure below has two layers, three neurons, and an input vector $\mathbf{x}$ with three components $x_1, x_2, x_3$. For simplicity, neurons $n_1$ and $n_2$ have no bias, while $n_3$ has bias $b$. Also, $n_1$ and $n_2$ have a ReLU nonlinearity while $n_3$ is linear (that is, it has no non-linearity). The network gains are shown near the edges.
Thus, the relation $\hat{y} = h(\mathbf{x})$ between input vector $\mathbf{x} = (x_1, x_2, x_3)$ and output $\hat{y}$ from the network is as follows:
$$ \begin{eqnarray*} a_1 &=& u_{11} x_1 + u_{12} x_2 + u_{13} x_3 \\ a_2 &=& u_{21} x_1 + u_{22} x_2 + u_{23} x_3 \\ z_1 &=& \max(0, a_1) \\ z_2 &=& \max(0, a_2) \\ \hat{y} &=& v_1 z_1 + v_2 z_2 + b \end{eqnarray*} $$The variables $a_1$ and $a_2$ in these expressions denote the activations of neurons $n_1$ and $n_2$, that is, the values of their outputs before they are passed through the ReLU $\rho(a) = \max(0, a)$. Making $a_1$ and $a_2$ explicit will simplify your work later on.
The numerical values of the weights for this network are as follows:
$$ \begin{array}{lll} u_{11} = 2 & u_{12} = -1 & u_{13} = 0 \\ u_{21} = -1 & u_{22} = 1 & u_{23} = 3 \\ v_1 = 4 & v_2 = 1 & b = -4 \end{array} $$We collect all the network weights into a vector $\mathbf{w}$ in the following order:
$$ \mathbf{w} = (u_{11}, u_{12}, u_{13}, u_{21}, u_{22}, u_{23}, v_1, v_2, b) \in\mathbb{R}^9 \;. $$Given an input vector $\mathbf{x}$ with true output value $y$, the quadratic loss is used to measure estimation errors. This loss is defined as follows: $$ \ell(y, \mathbf{x}) = \frac{1}{2} [h(\mathbf{x}) - y]^2 \;. $$
The division by 2 is introduced for mathematical convenience, since it cancels a 2 that comes from differentiating the square when computing gradients.
To run a forward pass through a network means to compute $\hat{y} = h(\mathbf{x})$ for a specific input $\mathbf{x}$.
Run a forward pass through the network in the figure for input $$ \mathbf{x} = (-3, -2, 2) $$ giving, in this order, the numerical values of $a_1$, $a_2$, $z_1$, $z_2$, and $\hat{y}$. You will need these values in the next problem.
Finally, give the value of the quadratic loss if the true value for this input is $y = 5$.
Back-propagating through the network for a particular input $\mathbf{x}$ and set of values for the network weights in $\mathbf{w}$ means computing the gradient
$$ \mathbf{g} = \frac{\partial\ell}{\partial\mathbf{w}} \in \mathbb{R}^9 $$for that input and those weights.
The calculation is done as described in the class notes.
In the specific instance, the order of calculations is as follows, where $g_\alpha$ is shorthand for $\frac{\partial\ell}{\partial\alpha}$:
$$ g_{\hat{y}}, g_b, g_{v_2}, g_{v_1}, g_{z_2}, g_{z_1}, g_{a_2}, g_{a_1}, g_{u_{23}}, g_{u_{22}}, g_{u_{21}}, g_{u_{13}}, g_{u_{12}}, g_{u_{11}} $$Back-propagate through the given network for the numerical input and true label values given above. Once you are done, also give the vector $g_{\mathbf{w}} = \frac{\partial\ell}{\partial\mathbf{w}}$ in numerical form, where the entries are now ordered as they are in the vector $\mathbf{w}$ defined above.
This time, do show your calculations, using the format given in the homework template file. The first few lines in the template are done for you, so you see what format we are looking for.
You may use the expression $\sigma(a)$ to denote a sub-derivative of the ReLU function $\rho(a)$. One possible definition fo the sub-derivative of $\sigma(a)$ is 1 for $a > 0$ and 0 otherwise.
Recognizing images from the MNIST database of handwritten digits has been said to be the Hello, World of classification problems. As classification problems go, MNIST digit recognition is relatively easy, in that it is not hard to reach test accuracies in the mid nineties with a variety of classifiers.
The code below retrieves a version of this database that splits the samples into 10,500 training samples and 59,500 test samples. This starves training of data somewhat, making the problem a bit more challenging.
import pickle
file_name = 'mnist.pickle'
retrieve(file_name)
with open(file_name, 'rb') as file:
mnist = pickle.load(file)
Using previously downloaded file mnist.pickle
The function experiment
below uses a Multi-Layer Perceptron (MLP) classifier to recognize MNIST digits with this data split. An MLP classifier is a neural network that has only fully-connected layers. The version used below has a ReLU at the output of each layer.
def print_result(h, d):
accuracy = {
'train': h.score(d['train']['x'], d['train']['y']) * 100,
'test': h.score(d['test']['x'], d['test']['y']) * 100
}
print('training accuracy: {:.2f} percent'.format(accuracy['train']))
print('test accuracy: {:.2f} percent'.format(accuracy['test']))
max_points = 20
p = (accuracy['test'] - 90.) / (96. - 90.) * max_points
p = min((max_points, max((0, p))))
p = round(p)
print('{} out of {} points'.format(p, max_points))
from sklearn.neural_network import MLPClassifier
from sklearn.exceptions import ConvergenceWarning
import warnings
def experiment(data, hidden_layer_sizes, max_iter, alpha,
learning_rate_init, verbose=False):
mlp = MLPClassifier(
hidden_layer_sizes=hidden_layer_sizes,
max_iter=max_iter,
alpha=alpha,
learning_rate_init=learning_rate_init,
learning_rate='constant',
solver='sgd',
random_state=1,
verbose=verbose
)
with warnings.catch_warnings():
warnings.filterwarnings(
"ignore", category=ConvergenceWarning, module="sklearn")
mlp.fit(data['train']['x'], data['train']['y'])
print_result(mlp, data)
The label set for the MNIST recognition problem is $Y = \{0, 1, 2, 3, 4, 5, 6, 7, 8, 9\}$.
Suppose that you run a blind classifier which, given an input $\mathbf{x}$, ignores $\mathbf{x}$ and outputs a prediction $\hat{y}$ drawn uniformly at random out of $Y$. If the MNIST data is balanced, that is, if each label out of $Y$ is equally likely to occur, what is the expected accuracy of the blind classifier, as a percentage?
No need to explain.
Let us run a first experiment on the MNIST data:
experiment(
mnist,
hidden_layer_sizes=(20, 5),
max_iter=20,
alpha=1e-8,
learning_rate_init=0.5,
verbose=False
)
training accuracy: 10.56 percent test accuracy: 10.39 percent 0 out of 20 points
We are not doing too well. The last line gives the number of points you would get for this problem (out of a maximum of 20 points) for this performance. The calculation in print_result
is set up so that there is no credit for scores below 90 percent test accuracy and no additional score beyond 96 percent.
Here is a brief explanation of the parameters of experiment
, which are passed to the initialization of the MLPClassifier
instance, except for mnist
, which contains the data:
hidden_layer_sizes
is a tuple that contains the number of output scalars for each layer. Thus, our first MLP has two hidden layers with 20 and 5 output values. After that, the MLP has a soft-max layer with 10 outputs, one per digit. The cross-entropy of these outputs is used as the loss function to train the network.max_iter
is the maximum number of training iterations. Training stops earlier if convergence is achieved. Note the catch_warnings
context manager in experiment
: If training fails to converge within max_iter
iterations, a warning is issued. The context manager suppresses those warnings to unclutter your output.alpha
is a regularization coefficient. With $N$ training samples, a term $\alpha \|\mathbf{w}\|^2 / N$ is added to the risk to lower the ability of the MLP to overfit. In this expression, $\mathbf{w}$ is a vector that collects all the weights in the network.learning_rate_init
is the initial learning rate, which is held constant throughout training in experiment
.verbose
is equal to True
, a message is output at every iteration of training. You may want to set verbose
to True
as you try thing out, so you can see what is going on. However, set verbose
to False
in your final run, to prevent long printouts in your PDF file.Try experiment
with other values for the parameters to get a grade higher than 0/20 for this problem.
Only show your final run, which is self-grading. Do not explain your reasoning or show your other attempts.
Important: You may not change anything other than the values passed to experiment
. For instance, you may not change the random_state
or the solver
specified to MLPClassifier
, or any other part of experiment
or print_result
, or anything in data
. Doing so would be cheating.