COMPSCI 371 Homework 10¶

Homework Submission Workflow¶

When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.

Note on Exam-Style Problems¶

Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.

As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.

Problem 0 (3 points)¶

Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?

Note¶

This question requires no explicit answer from you, and no space is provided for an answer in the template file.

However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.

Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.

Part 1: Mathematics of Correlation and Convolution¶

The essence of correlation and convolution is best understood in the one-dimensional case, since multi-dimensional versions of these operators are straightforward extensions of their one-dimensional versions.

In this part we denote correlation with the symbol '$\ast_+$' and convolution with the symbol '$\ast_-$' (in LaTeX, $\ast_+$ and $\ast_-$). When doing mathematics, it is easiest to think of the inputs to a correlation or a convolution as infinite sequences.

The correlation $r = a \ast_+ b$ of these sequences $a$ and $b$ then, is simply the infinite sequence

$$ r_j = \sum_i a_i b_{j+i} \;. $$

The entries $r_j$ in the sequence $r$ are defined for all integers $j$ and the summation extends over all integers $i$.

Similarly for the convolution $c = a \ast_- b$:

$$ c_j = \sum_i a_i b_{j-i} \;. $$

Because all possible values of $r$ and $c$ are computed, this view corresponds to the full correlation or convolution we saw in class.

Problem 1.1 (Exam Style)¶

Use the notation above and a change of variables to show that convolution commutes:

$$ a \ast_- b = b \ast_- a \;. $$

Hints¶

  • To avoid confusion, introduce a new variable name (say, $k$) to express your change of variable rather than reusing either $i$ or $j$. Remember that summation indices are "dummy variables," that is, their names do not matter. You can always rename them when you are done if you want.
  • This proof is a one-liner.

Problem 1.2 (Exam Style)¶

Correlation, on the other hand, does not commute. This is an important reason for preferring convolution to correlation whenever possible.

Use a change of variable to show that $b \ast_+ a$ is the sequence $a \ast_+ b$ in reverse order: $(a \ast_+ b)_j = (b \ast_+ a)_{-j}$.

Hint¶

This proof is similar to the previous one.

Part 2: Coding Correlation and Convolution¶

Finite sequences $a$ and $b$ can be accommodated into the mathematical notation described in Part 1 by thinking of them as being padded with an infinite number of zeros on both sides. However, we can only work with finite sequences when programming. This is when we need to pay close attention to the distinction between full, valid, and same mode correlations and convolutions.

The code retrieved and imported below uses the numpy functions correlate and convolve to define a new function with header

def oracle(operation, a, b, mode):

that computes correlations or convolutions in all three modes.

In addition, the function with header

def show(operation, a, b, mode, c):

also imported below, prints a line that describes the operation and shows its output, as illustrated by the function test defined and called below.

In [1]:
from urllib.request import urlretrieve
from os import path as osp


def retrieve(file_name, semester='fall22', course='371', homework=10):
    if osp.exists(file_name):
        print('Using previously downloaded file {}'.format(file_name))
    else:
        fmt = 'https://www2.cs.duke.edu/courses/{}/compsci{}/homework/{}/{}'
        url = fmt.format(semester, course, homework, file_name)
        urlretrieve(url, file_name)
        print('Downloaded file {}'.format(file_name))
In [2]:
retrieve('oracle.py')
Using previously downloaded file oracle.py
In [3]:
from oracle import oracle, show


def test(function, a=[1, 2, 3, 4], b=[5, 6]):
    for operator in ('correlate', 'convolve'):
        for mode in ('full', 'same', 'valid'):
            for f, g in ((a, b), (b, a)):
                c = function(operator, f, g, mode)
                show(operator, f, g, mode, c)
In [4]:
test(oracle)
correlate([1, 2, 3, 4], [5, 6], full) = [6, 17, 28, 39, 20]
correlate([5, 6], [1, 2, 3, 4], full) = [20, 39, 28, 17, 6]
correlate([1, 2, 3, 4], [5, 6], same) = [6, 17, 28, 39]
correlate([5, 6], [1, 2, 3, 4], same) = [39, 28, 17, 6]
correlate([1, 2, 3, 4], [5, 6], valid) = [17, 28, 39]
correlate([5, 6], [1, 2, 3, 4], valid) = [39, 28, 17]
convolve([1, 2, 3, 4], [5, 6], full) = [5, 16, 27, 38, 24]
convolve([5, 6], [1, 2, 3, 4], full) = [5, 16, 27, 38, 24]
convolve([1, 2, 3, 4], [5, 6], same) = [5, 16, 27, 38]
convolve([5, 6], [1, 2, 3, 4], same) = [5, 16, 27, 38]
convolve([1, 2, 3, 4], [5, 6], valid) = [16, 27, 38]
convolve([5, 6], [1, 2, 3, 4], valid) = [16, 27, 38]

Note that the innermost for loop in test considers the two input sequences in either order, so you can verify that the commutativity or anti-commutativity properties you proved in the previous part also hold in the finite case.

Reimplementing the Oracle¶

You should consider the function oracle as an oracle that tells you the true value of a convolution or a correlation. The goal of this part of the assignment is to define two new functions that replicate what oracle does but without using numpy. This will give you an opportunity to understand these operations more concretely.

Programming a Valid Convolution¶

From the programming point of view, convolution is easier than correlation because of commutativity: If it is convenient to flip the order of the arguments, than it is OK to do so without changing the result. In addition, the valid mode is the easiest one to program because it involves no input padding or special treatment of the start or finish of any of the sequences involved. All you need to do it to get the indices right.

Because of this, you will reimplement the oracle by first implementing the valid convolution and then defining a second function that adapts the valid convolution to the full and same modes, and then to the three modes of correlation.

Problem 2.1¶

Define a function with header

def valid_convolution(a, b):

that returns (without printing anything) the valid convolution of the two finite sequences a and b. All sequences are represented as Python lists of numbers (integer or float, does not matter).

Your Task¶

Show your code and the result of the calls

a, b = [1, 2, 3, 4], [5, 6]
print(valid_convolution(a, b))
print(valid_convolution(b, a))

You will get no credit if your implementation imports numpy or any other package.

Working Out the Valid Convolution¶

It it probably best to work out a small example to avoid getting confused with index arithmetic. Let $\ell$ be the longer of $a, b$ and $s$ be the shorter.

If $|a| = 7$ and $|b| = 4$, then

$$ \begin{eqnarray*} a &=& \ell = [l_0, l_1, l_2, l_3, l_4, l_5, l_6] \\ b &=& s = [s_0, s_1, s_2, s_3] \;. \end{eqnarray*} $$

The valid convolution of $a$ and $b$ is obtained by flipping the shorter sequence $s$ (that is, reversing the order of its entries) and then sliding it along $\ell$ left to right. For all positions where the flipped $s$ and $\ell$ overlap fully, compute the product of corresponding entries, add them up, and store the sum in the proper entry of $c$. If $\sigma$ is the flipped $s$,

$$ (\sigma_0, \sigma_1, \sigma_2, \sigma_3) = (s_3, s_2, s_1, s_0) \;, $$

then we have the following set of positions and corresponding values of $c_j$, using Python-style subscripts:

$$ \begin{array}{*{8}{c}l} \ell_0 & \ell_1 & \ell_2 & \ell_3 & \ell_4 & \ell_5 & \ell_6 \\ \sigma_0 & \sigma_1 & \sigma_2 & \sigma_3 & & & & & c_0 = \sigma_0 \ell_0 + \sigma_1 \ell_1 + \sigma_2 \ell_2 + \sigma_3 \ell_3 \\ & \sigma_0 & \sigma_1 & \sigma_2 & \sigma_3 & & & & c_1 = \sigma_0 \ell_1 + \sigma_1 \ell_2 + \sigma_2 \ell_3 + \sigma_3 \ell_4 \\ & & \sigma_0 & \sigma_1 & \sigma_2 & \sigma_3 & & & c_2 = \sigma_0 \ell_2 + \sigma_1 \ell_3 + \sigma_2 \ell_4 + \sigma_3 \ell_5 \\ & & & \sigma_0 & \sigma_1 & \sigma_2 & \sigma_3 & & c_3 = \sigma_0 \ell_3 + \sigma_1 \ell_4 + \sigma_2 \ell_5 + \sigma_3 \ell_6 \\ \end{array} $$

A moment's reflection will show that this result can be generalized to the following general formula for the one-dimensional valid convolution:

$$ c_j = \sum_{0 \leq i < |s|} \sigma_i \ell_{i+j} \;\;\;\text{for}\;\;\; 0 \leq j < |\ell| - |s| + 1 $$

where $|x|$ denotes the length of sequence $x$.

Programming Notes¶

  • In code, it is safer to use ell rather than l to denote $\ell$, since l may be confused with the number one in some typefaces.

  • The expression for $c_j$ above is written with "Pythonic" notation for the ranges of indices. That is, $0 \leq i < |s|$ can be implemented as i in range(len(s)) and $0 \leq j < |\ell| - |s| + 1$ becomes j in range(len(ell) - len(s) + 1).

  • For a particular value of $j$, the summation that defines $c_j$ is an inner (or dot) product of s with ell[j:j+len(s)], which are two sequences of length $|s|$. Because of this, it may be simpler to first define a helper function dot that computes this dot product, and then let valid_convolution call that function in a loop over $j$.

  • List comprehensions are elegant and compact ways to implement both dot and valid_convolution, but feel free to use standard loops instead.

  • To flip s say s[::-1]. While there are several other ways to reverse a list in Python, they all have their drawbacks for this application. The most readable option is to define a helper function, which will turn out to be useful later on as well:

      def flip(s):
          return s[::-1]
  • You may want to check that the outputs from your two test calls are consistent with the output of the test(oracle) call shown earlier.

Problem 2.2¶

The same mode correlation (or convolution) operator takes two finite sequences $a$ and $b$ and returns a correlation (or convolution) sequence $c$ that has the length of the longer of $a$ and $b$. Specifically, the length of $c$ is

$$ |c| = \max(|a|, |b|) \;. $$

To accomplish this, the longer sequence is padded with a total number of zeros that is one less than the length of the shorter sequence. That is, if $m = \min(|a|, |b|)$, then the total amount of padding is $p = m - 1$ zeros. This padding is split approximately in half, and the two pieces are concatenated to the two ends of the longer sequence. The same correlation (or convolution) of $a$ and $b$ is then the valid correlation (or convolution) of the two sequences after the longer one has been padded in this way.

The trouble with this description is the word "approximately." If $m$ is even, then $p$ is odd, and the padding cannot be split evenly, so there are two ways to do the split.

For instance, in our running example we have $|a| = 7$ and $|b| = 4$ so that $m = \min(7, 4) = 4$ and $p = 4 - 1 = 3$. Then, the longer sequence, which is $a$, is padded with a total of 3 zeros, to make it 10 items long. Since $p$ is odd in this case, we need to decide which of the two padding methods below numpy.correlate and numpy.convolve actually use:

$$ 0, a_0,\ldots, a_6, 0, 0 \;\;\;\;\;\text{or}\;\;\;\;\; 0, 0, a_0,\ldots, a_6, 0 $$

Let us call the first one "short left padding" and the second one "long left padding."

Your Task¶

Either use your valid_convolution or paper and pencil to figure out the results of the same convolution of the sequences $a = [1, 2, 3, 4, 5, 6, 7]$ and $b = [8, 9, 10, 11]$ with each of the two padding alternatives above.

Then compare your results with the printout from appropriate calls to oracle with the same mode option and state which padding style (short left or long left) numpy.convolution uses.

Notes¶

  • To print the results from numpy.convolution, use the function show imported earlier from oracle.
  • You can see how to call oracle or show either by looking at the function test defined earlier or by examining the code in oracle.py.
  • The function numpy.correlation uses the same padding style as numpy.convolution, so there is no need to check that as well.

Problem 2.3¶

Now you know the padding style for the same option for both convolution and correlation.

The full padding style pads the longer sequence $\ell$ with $|s| - 1$ zeros on each side, where $s$ is the shorter sequence, and then performs a valid convolution of the padded $\ell$ with $s$.

Write a function with header

def apply(operation, a, b, mode):

that replicates what oracle does, but without using numpy.

Show your code and the result of the call

test(apply)

Again, you will get no credit if your implementation imports numpy or any other package.

Programming Notes¶

  • If operation is 'convolve', pad the longer sequence appropriately for each mode, if needed, and then call valid_convolution.
  • If operation is 'correlate', call apply recursively with appropriate arguments. A single call here takes care of all three modes. However, if mode is 'same' and one of the two sequences needs to be both flipped and padded, you need to do the padding before you call apply recursively, and then change the mode to 'valid' in the recursive call. Otherwise, the padding will happen on the wrong side because of the flip. Work out a small example to figure this out.
  • To keep the code clean, define helper functions for any operation you do more than once. These include the following:
    • A function that returns the lengths of left and right zero padding for the same mode given the length of the shorter sequence.
    • A function that returns a padded sequence given the sequence to be padded and the lengths of the left and right zero paddings.
  • You may want to check that test(apply) gives the same outputs as the call test(oracle) done earlier.

Part 3: Back-Propagation through a Convolution¶

To keep things simple, we stay in the one-dimensional case for this part. We consider the full convolution here. So

$$ c = a \ast_-^f b $$

is the full convolution of sequences $a$ and $b$. The subscript $-$ means convolution, as before, and the superscript $f$ means full. We use superscripts $v$ and $s$ for valid and same if needed.

We assume $a$ and $b$ to be finite with lengths $|a|$ and $|b|$. The length of $c$ is therefore $|c| = |a| + |b| - 1$.

We also repurpose the symbol $\ell$ here. Consistently with usage in the class notes, $\ell$ now is the loss function used to train a neural network.

Suppose now that you are given the vector with the numerical values of the gradient of the loss function $\ell$ with respect to the output $c$ of the full convolution above for some specific data sample:

$$ \gamma = \frac{\partial\ell}{\partial c}\;. $$

This gradient is a vector in $\mathbb{R}^{|c|}$.

To back-propagate through the convolution above means to compute the gradients of $\ell$ with respect to the inputs $a$ and $b$ of the convolution, that is, the two vectors

$$ \alpha = \frac{\partial\ell}{\partial a} \in \mathbb{R}^{|a|} \;\;\;\text{and}\;\;\; \beta = \frac{\partial\ell}{\partial b} \in \mathbb{R}^{|b|} \;. $$

Problem 3.1 (Exam Style)¶

Give expressions for $\alpha$ and $\beta$ in terms of $a$, $b$, and $\gamma$. Express your results as simply as possible in terms of convolutions or correlations, specifying also the mode (full, valid, same).

Assume that $|a| \geq |b|$.

Hint¶

To obtain your expressions, start from an example similar to that in problem 2.1 above but a bit smaller, say with $|a| = 4$ and $|b| = 3$. Then use the chain rule of differentiation to write formulas for each entry of $\alpha$ and $\beta$ in terms of the individual entries of $a$, $b$, and $\gamma$. Finally, examine the pattern of the result and think how you can rewrite it in compact form.

To get you started, the homework template file contains the following equations that spell out that convolution directly in terms of the entries of $a$ and $b$.

$$ \begin{eqnarray*} c_0 &=& a_0 b_0 \\ c_1 &=& a_0 b_1 + a_1 b_0 \\ c_2 &=& a_0 b_2 + a_1 b_1 + a_2 b_0 \\ c_3 &=& a_1 b_2 + a_2 b_1 + a_3 b_0 \\ c_4 &=& a_2 b_2 + a_3 b_1 \\ c_5 &=& a_3 b_2 \end{eqnarray*} $$

Notes¶

  • There are 7 partial derivatives to compute for the example, each with several terms, so be patient.
  • Cutting and pasting LaTeX is fine, but be careful with the subscripts. This problem is all about the subscripts!
  • Not all variables show up in all equations. It is OK to omit terms that are zero when you write out the chain rule.
  • Don't try to guess the pattern until you are done. Write out all equations carefully, check for mistakes, and then look for a pattern. If you don't see the pattern from your expressions, make a drawing for yourself to see what is going on.

Part 4: Network Back-Propagation¶

The tiny neural network in the figure below has two layers, three neurons, and an input vector $\mathbf{x}$ with three components $x_1, x_2, x_3$. For simplicity, neurons $n_1$ and $n_2$ have no bias, while $n_3$ has bias $b$. Also, $n_1$ and $n_2$ have a ReLU nonlinearity while $n_3$ is linear (that is, it has no non-linearity). The network gains are shown near the edges.

tiny network

Thus, the relation $\hat{y} = h(\mathbf{x})$ between input vector $\mathbf{x} = (x_1, x_2, x_3)$ and output $\hat{y}$ from the network is as follows:

$$ \begin{eqnarray*} a_1 &=& u_{11} x_1 + u_{12} x_2 + u_{13} x_3 \\ a_2 &=& u_{21} x_1 + u_{22} x_2 + u_{23} x_3 \\ z_1 &=& \max(0, a_1) \\ z_2 &=& \max(0, a_2) \\ \hat{y} &=& v_1 z_1 + v_2 z_2 + b \end{eqnarray*} $$

The variables $a_1$ and $a_2$ in these expressions denote the activations of neurons $n_1$ and $n_2$, that is, the values of their outputs before they are passed through the ReLU $\rho(a) = \max(0, a)$. Making $a_1$ and $a_2$ explicit will simplify your work later on.

The numerical values of the weights for this network are as follows:

$$ \begin{array}{lll} u_{11} = 2 & u_{12} = -1 & u_{13} = 0 \\ u_{21} = -1 & u_{22} = 1 & u_{23} = 3 \\ v_1 = 4 & v_2 = 1 & b = -4 \end{array} $$

We collect all the network weights into a vector $\mathbf{w}$ in the following order:

$$ \mathbf{w} = (u_{11}, u_{12}, u_{13}, u_{21}, u_{22}, u_{23}, v_1, v_2, b) \in\mathbb{R}^9 \;. $$

Given an input vector $\mathbf{x}$ with true output value $y$, the quadratic loss is used to measure estimation errors. This loss is defined as follows: $$ \ell(y, \mathbf{x}) = \frac{1}{2} [h(\mathbf{x}) - y]^2 \;. $$

The division by 2 is introduced for mathematical convenience, since it cancels a 2 that comes from differentiating the square when computing gradients.

Problem 4.1 (Exam Style)¶

To run a forward pass through a network means to compute $\hat{y} = h(\mathbf{x})$ for a specific input $\mathbf{x}$.

Run a forward pass through the network in the figure for input $$ \mathbf{x} = (-3, -2, 2) $$ giving, in this order, the numerical values of $a_1$, $a_2$, $z_1$, $z_2$, and $\hat{y}$. You will need these values in the next problem.

Finally, give the value of the quadratic loss if the true value for this input is $y = 5$.

Notes¶

  • There is no need to show your calculations if you are sure of your answers, since the calculations merely follow the equations given above.
  • However, be careful in your calculations, since mistakes will propagate to the next problem and you will be penalized twice for them.

Problem 4.2 (Exam Style)¶

Back-propagating through the network for a particular input $\mathbf{x}$ and set of values for the network weights in $\mathbf{w}$ means computing the gradient

$$ \mathbf{g} = \frac{\partial\ell}{\partial\mathbf{w}} \in \mathbb{R}^9 $$

for that input and those weights.

The calculation is done as described in the class notes.

In the specific instance, the order of calculations is as follows, where $g_\alpha$ is shorthand for $\frac{\partial\ell}{\partial\alpha}$:

$$ g_{\hat{y}}, g_b, g_{v_2}, g_{v_1}, g_{z_2}, g_{z_1}, g_{a_2}, g_{a_1}, g_{u_{23}}, g_{u_{22}}, g_{u_{21}}, g_{u_{13}}, g_{u_{12}}, g_{u_{11}} $$

Back-propagate through the given network for the numerical input and true label values given above. Once you are done, also give the vector $g_{\mathbf{w}} = \frac{\partial\ell}{\partial\mathbf{w}}$ in numerical form, where the entries are now ordered as they are in the vector $\mathbf{w}$ defined above.

This time, do show your calculations, using the format given in the homework template file. The first few lines in the template are done for you, so you see what format we are looking for.

You may use the expression $\sigma(a)$ to denote a sub-derivative of the ReLU function $\rho(a)$. One possible definition fo the sub-derivative of $\sigma(a)$ is 1 for $a > 0$ and 0 otherwise.

Part 5: MNIST Digit Classification¶

Recognizing images from the MNIST database of handwritten digits has been said to be the Hello, World of classification problems. As classification problems go, MNIST digit recognition is relatively easy, in that it is not hard to reach test accuracies in the mid nineties with a variety of classifiers.

The code below retrieves a version of this database that splits the samples into 10,500 training samples and 59,500 test samples. This starves training of data somewhat, making the problem a bit more challenging.

In [5]:
import pickle


file_name = 'mnist.pickle'
retrieve(file_name)
with open(file_name, 'rb') as file:
    mnist = pickle.load(file)
Using previously downloaded file mnist.pickle

The function experiment below uses a Multi-Layer Perceptron (MLP) classifier to recognize MNIST digits with this data split. An MLP classifier is a neural network that has only fully-connected layers. The version used below has a ReLU at the output of each layer.

In [6]:
def print_result(h, d):
    accuracy = {
        'train': h.score(d['train']['x'], d['train']['y']) * 100,
        'test': h.score(d['test']['x'], d['test']['y']) * 100
    }
    print('training accuracy: {:.2f} percent'.format(accuracy['train']))
    print('test accuracy: {:.2f} percent'.format(accuracy['test']))
    max_points = 20
    p = (accuracy['test'] - 90.) / (96. - 90.) * max_points
    p = min((max_points, max((0, p))))
    p = round(p)
    print('{} out of {} points'.format(p, max_points))
In [7]:
from sklearn.neural_network import MLPClassifier
from sklearn.exceptions import ConvergenceWarning
import warnings


def experiment(data, hidden_layer_sizes, max_iter, alpha,
               learning_rate_init, verbose=False):
    mlp = MLPClassifier(
        hidden_layer_sizes=hidden_layer_sizes,
        max_iter=max_iter,
        alpha=alpha,
        learning_rate_init=learning_rate_init,
        learning_rate='constant',
        solver='sgd',
        random_state=1,
        verbose=verbose
    )
    with warnings.catch_warnings():
        warnings.filterwarnings(
            "ignore", category=ConvergenceWarning, module="sklearn")
        mlp.fit(data['train']['x'], data['train']['y'])
    print_result(mlp, data)

Problem 5.1 (Exam Style)¶

The label set for the MNIST recognition problem is $Y = \{0, 1, 2, 3, 4, 5, 6, 7, 8, 9\}$.

Suppose that you run a blind classifier which, given an input $\mathbf{x}$, ignores $\mathbf{x}$ and outputs a prediction $\hat{y}$ drawn uniformly at random out of $Y$. If the MNIST data is balanced, that is, if each label out of $Y$ is equally likely to occur, what is the expected accuracy of the blind classifier, as a percentage?

No need to explain.

Problem 5.2¶

Let us run a first experiment on the MNIST data:

In [8]:
experiment(
    mnist,
    hidden_layer_sizes=(20, 5),
    max_iter=20,
    alpha=1e-8,
    learning_rate_init=0.5,
    verbose=False
)
training accuracy: 10.56 percent
test accuracy: 10.39 percent
0 out of 20 points

We are not doing too well. The last line gives the number of points you would get for this problem (out of a maximum of 20 points) for this performance. The calculation in print_result is set up so that there is no credit for scores below 90 percent test accuracy and no additional score beyond 96 percent.

Here is a brief explanation of the parameters of experiment, which are passed to the initialization of the MLPClassifier instance, except for mnist, which contains the data:

  • hidden_layer_sizes is a tuple that contains the number of output scalars for each layer. Thus, our first MLP has two hidden layers with 20 and 5 output values. After that, the MLP has a soft-max layer with 10 outputs, one per digit. The cross-entropy of these outputs is used as the loss function to train the network.
  • max_iter is the maximum number of training iterations. Training stops earlier if convergence is achieved. Note the catch_warnings context manager in experiment: If training fails to converge within max_iter iterations, a warning is issued. The context manager suppresses those warnings to unclutter your output.
  • alpha is a regularization coefficient. With $N$ training samples, a term $\alpha \|\mathbf{w}\|^2 / N$ is added to the risk to lower the ability of the MLP to overfit. In this expression, $\mathbf{w}$ is a vector that collects all the weights in the network.
  • learning_rate_init is the initial learning rate, which is held constant throughout training in experiment.
  • If verbose is equal to True, a message is output at every iteration of training. You may want to set verbose to True as you try thing out, so you can see what is going on. However, set verbose to False in your final run, to prevent long printouts in your PDF file.

Your Task¶

Try experiment with other values for the parameters to get a grade higher than 0/20 for this problem.

Only show your final run, which is self-grading. Do not explain your reasoning or show your other attempts.

Important: You may not change anything other than the values passed to experiment. For instance, you may not change the random_state or the solver specified to MLPClassifier, or any other part of experiment or print_result, or anything in data. Doing so would be cheating.

Notes¶

  • MNIST is a very peculiar database. It is small enough that we can run experiments rather quickly. However, other than this, any behavior you obtain on this dataset will usually not extend to others. The point of this exercise is to show you the type of tweaking that often occurs even in experiments described in the scientific literature.
  • Of course, there are many combinations of the parameters that yield similar results, so it would be suspicious if your settings were equal to those of some of your peers.