Homework 2

This homework assignment explores some situations in which the geometric intuition we developed though our sensory experience in two or three dimensions does not carry into higher-dimensional spaces (part 1). It then reviews some terminology (part 2). Finally, in part 3, it previews, in a simple context and in simplified form, some concepts related to validation, which we will explore more deeply later in the course.

None of the answers in this assignment require mathematical proof.

Homework Submission Workflow

When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.

Part 1: Intuition in Many Dimensions

This part explores some important differences between low-dimensional and high-dimensional worlds. These discrepancies should caution us not to rely too much on geometric intuition when we work in spaces with many dimensions.

Cubes, Spheres, Corners, and Skins

Let $d$ be a positive integer. Given a positive real number $a$, the set $$ C(d, a) = \{\mathbf{x}\in\mathbb{R}^d \ :\ -a/2 \leq x_i \leq a/2 \;\;\text{for}\;\; i=1,\ldots,d\} $$ is a cube of side-length $a$, centered at the origin. The unit cube is $C(d, 1)$.

The unit sphere $S(d)$ in $d$ dimensions is the set of points at Euclidean distance at most 1 from the origin: $$ S(d) = \{\mathbf{x}\in\mathbb{R}^d \ :\ \|\mathbf{x}\| \leq 1 \}\;. $$

The unit sphere $S(d)$ is exactly inscribed in the 2-cube $C(d, 2)$, as shown on the left in the figure below for $d=2$. Let us call the set $$ V(d, 2) = C(d, 2) \setminus S(d) $$ (where '$\setminus$' denotes set difference) the corners of the 2-cube. This is the part of the 2-cube that is not in the sphere inscribed in it, and is shown in gray on the left in the figure below for $d=2$.

Let $\epsilon$ be a small positive real number ($0<\epsilon \ll 1$). The $\epsilon$-skin $E(d, \epsilon)$ of the unit cube $C(d, 1)$ is a layer of thickness $\epsilon/2$ just inside the surface of $C(d, 1)$: $$ E(d, \epsilon) = C(d, 1) \setminus C(d, 1-\epsilon) \;. $$ The figure on the right below shows the $\epsilon$ skin $E(2, \epsilon)$ in gray for some small $\epsilon$.

blah

Volume

The volume of a compact set $\mathcal{A}\in\mathbb{R}^d$ is the integral of $1$ over $\mathcal{A}$.

It can be shown (see for instance C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006) that the volume of $S(d)$ is $$s(d) = \frac{2\pi^{d/2}}{d\ \Gamma(d/2)}$$ where $\Gamma(z)$ is the gamma function.

The Python function scipy.special.gamma computes $\Gamma(z)$. In particular, for $d=1,2,3$

$$ \Gamma(1/2) = \sqrt{\pi}\;\;\;\text{,}\;\;\; \Gamma(1) = 1 \;\;\;\text{,}\;\;\; \Gamma(3/2) = \sqrt{\pi}/2 \;. $$

Problem 1.1

What is the volume $c(d, a)$ of $C(d, a)$ and, in particular, what is $c(d, 1)$? No need to prove your answers.

Problem 1.2

Give a simple formula for the volume $e(d, \epsilon)$ of $E(d, \epsilon)$ in terms of $d$ and $\epsilon$.

Problem 1.3

Use the function matplotlib.pyplot.loglog to plot $e(d, 10^{-3})$ for $d$ between $1$ and $10^{6}$, and with logarithmic scaling on both axes. Label the axes.

Programming Notes

  • You may want to refer to Homework 1 for how to display a plot.

  • To make matplotlib code work correctly with multiple figures in the same notebook, put the line

    %matplotlib inline

    before you do your first plot.

  • Of course, there is no need to compute values for every integer $d$ in the requested range. The function numpy.logspace with default values for the optional arguments gives a reasonable sampling.

Problem 1.4

Write a simple sentence that describes where most of the points in the unit cube are when the number $d$ of dimensions is large and $\epsilon$ is small. What is the approximate numerical value of the fraction of points in the $\epsilon$-skin of the unit cube when $d=2$ and $\epsilon=10^{-3}$?

Problem 1.5

Write a formula for the ratio $r(d)$ between the volume of the unit sphere and that of the 2-cube. Simplify your formula if possible and show your calculations.

Problem 1.6

Plot $r(d)$ (as a regular plot, no logarithms) for every integer $d$ between 1 and 10. Since only integer values of $d$ make sense, draw dots connected by a solid line (that is, specify options marker='.', markersize=12 for matplotlib.pyplot.plot).

Problem 1.7

Approximately what percentage of the volume of a cube is contained in the sphere inscribed the cube when $d=2$? And when $d=10$?

Part 2: Loss

The input $\mathbf{x}$ to all predictors mentioned in the problems that follow is a black-and-white image with $307,\negthinspace 200 = 480\times 640$ pixels arranged in $480$ rows and $640$ columns. For simplicity, the value of a pixel is a real number, without further restriction. The feature function $\phi$ defined in class (lecture on functions and data fitting) is the identity function. As a consequence, the functions $f$ and $h$ are the same, and so are the two domains $A$ and $X$.

For a classifier, the zero-one loss $\ell(y, y')$ is a loss function that assignes a value $1$ to a classification mistake (in which the true output $y$ is different from the predicted output $y'$) and a value $0$ to a correct prediction. This loss can be represented by a matrix that has one row per possible true output $y$ and one column per possible predicted ouptut $y'$. Entry at row $y$ and column $y'$ of this matrix contains the value of the loss corresponding to $y$ and $y'$.

Problem 2.1

The output $y$ to a classifier $h$ is the activity performed by the single person visible in the image $\mathbf{x}$. That person is playing tennis, and the possible activities are serve, forehand, backhand, volley, and other. The other activity covers any shot that is not of the other types, and any other activity or lack thereof.

a. If the signature of $h$ is $X\rightarrow Y$, what are $X$ and $Y$, in terms of quantities given in the text for this part?

b. Write the zero-one loss matrix for this classifier.

Label your answers a and b.

Problem 2.2

The output $\mathbf{y}$ of a predictor $h$ (different from the one in the previous problem) is a vector of $25$ parameters that describe the tennis player's body configuration in the given image $\mathbf{x}$. You can think of these $25$ real numbers as angles that describe the orientation of each body joint (shoulder, elbow, knee,...) relative to the limb the joint is attached to (torso, arm, thigh,...).

a. Is $h$ a classifier or a regressor?

b. If the signature of $h$ is $X\rightarrow Y$, what are $X$ and $Y$, in terms of quantities given in the text for this part?

c. Why can the loss function not be represented by a matrix for this problem?

Part 3: Validation

The problems in this part explore a simplified version of concepts we will cover later in this class.

The key difference between data fitting and machine learning is that the former just needs to do well on the training data, while the latter is expected to do well also on previously unseen data. One way to check that it does so is to have two data sets instead of one: The training set $T$, on which the predictor $h$ is trained, and a separate set $V$ called the validation set, disjoint from $T$. (Depending on context, the second set could also be called the test set. More on this later in the course.)

The average loss of $h$ on $T$ is called the training risk, and the average loss on $V$ is called the validation risk. Both risks are empirical, in the sense that they are estimates computed on data. The validation set plays the role of "previously unseen data" and the validation risk is a (possibly crude) estimate of how $h$ performs on "previously unseen data."

We explore these concepts in the context of univariate polynomial regression, which we already saw in homework 1. Rather than writing our own fitting function, we use numpy.polyfit. Please look at its manual page online. Note in particular that this function retuns an array of coefficients in order of decreasing power. The function numpy.polyval takes a polynomial from numpy.polyfit and evaluates it at a given set of points.

Data

If the file data.pickle that comes with this assignment is in the same directory as your notebook, the following code reads a dictionary data from that file. The dictionary contains a training set data['T'] and a validation set data['V']. Each data set is in turn a dictionary. For instance, if you define

T = data['T']

for training set

$$ T = \{(x_1, y_1),\ldots, (x_N, y_N) \} $$

then T['x'] is a numpy one-dimensional array with all inputs $x_n$ and T['y'] is a similar array with all corresponding outputs $y_n$ in $T$. The format of the validation set $V$ is the same.

In [1]:
import pickle

with open('data.pickle', 'rb') as f:
    data = pickle.load(f)

Problem 3.1

Write a function train with two arguments. The first argument T is a training set represented like data['T'] above. The second argument k is a nonnegative integer. The function train fits a univariate polynomial of degree k to the data in T.

Show your code and the result of fitting a polynomial of degree 3 to data['T']. Specifically, show the coefficients of the polynomial in decreasing order of degree and with three decimals after the period, and then use the function show below (similar to that in homework 1, but using numpy.polyval) to plot both training set and polynomial.

In [2]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

def show(x, y, hList = []):
    plt.plot(x, y, marker='.', markersize=12, ls='')
    npt = 100
    xrange = [min(x), max(x)]
    xFine = np.linspace(xrange[0], xrange[1], npt)
    for h in hList:
        hxFine = np.polyval(h, xFine)
        plt.plot(xFine, hxFine, label= 'degree ' + str(len(h)-1))

    plt.xlabel('x')
    plt.ylabel('y')
    plt.legend()
    plt.show()

Hints:

  • The body of the function train is one line of code, or perhaps two.
  • You should not expect a great fit, as the data was not generated with a third-degree polynomial.

Problem 3.2

Write a function loss(y, yp) that takes the true value y and the predicted value yp for a regression problem and returns the quadratic loss $\ell(y, y') = (y - y')^2$. The arguments are numpy arrays of equal size, and the output is an array of the same size with all of the losses.

Then write a function L(h, T, l) that takes a numpy vector h of polynomial coefficients, a training set T represented as a dictionary with entries x and y as shown earlier, and a loss function l of the same signature as loss, and computes the empirical risk of h on T with loss l:

$$ L_T(h) = \frac{1}{N} \sum_{n=1}^N \ell(y_n, h(x_n)) $$

where $N$ is the number of samples in $T$.

Show your code and the numerical value of the empirical risk for the fit in problem 3.1. Use three decimal digits after the period.

(Hint: numpy.mean comes in handy.)

Problem 3.3

Make a figure that superimposes two plots in the same diagram, each plot represented as a sequence of dots conencted by lines (as in the function show above).

The first plot is the training risk, that is, the empirical risk on data['T'] for the fit in problem 3.1 but for polynomial degrees $k = 0,\ldots, 12$, using the quadratic loss. The second plot is the validation risk, that is, the analogous quantity estimated on data['V'] (of course, for both plots the polynomials are trained on data['T']).

Label the axes of the figure as "k" and "risk" and place a legend that specifies which plot is which.

Show both code and plot.

Problem 3.4

For what value $k^{\star}$ of $k$ does the fit generalize best to the validation data? Also use the function show to display the validation points and the fit to the training data for $k^{\star}$.

Problem 3.5

Why is the plot for the training risk in problem 3.3 monotonic non-increasing? (If your plot doesn't look that way, you may want to review your solution to 3.3.) Explain briefly and clearly.

Problem 3.6

There are two technical terms that denote the fact that the validation risk is (i) higher for $k < k^{\star}$ and (ii) higher for $k > k^{\star}$, when compared to the validation risk at $k^{\star}$. What are the two terms? Specify which term is for which case ((i) or (ii)).