COMPSCI 371 Homework 8¶

Homework Submission Workflow¶

When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.

Note on Exam-Style Problems¶

Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.

As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.

Problem 0 (3 points)¶

Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?

Note¶

This question requires no explicit answer from you, and no space is provided for an answer in the template file.

However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.

Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.

Part 1: Kernels¶

Anything that is stated in the class notes about SVMs and kernels can be assumed to be known and requires no proof for this part of the assignment.

Problem 1.1 (Exam Style)¶

Consider the following objects:

A training set $T = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\}$ with $\mathbf{x}_i\in\mathbb{R}^d$ and $Y\in\{-1, 1\}$.
A function $K(\mathbf{x}, \boldsymbol{\xi})$ from $\mathbb{R}^d\times \mathbb{R}^d$ to $\mathbb{R}$ for which the following equalities hold:

$$ K(\mathbf{x}_1, \mathbf{x}_1) = K(\mathbf{x}_2, \mathbf{x}_2) = 1 \;\;\;\text{and}\;\;\; K(\mathbf{x}_1, \mathbf{x}_2) = K(\mathbf{x}_2, \mathbf{x}_1) = 2 \;. $$

These equalities involve the first two samples out of $T$.

Either construct a function $\boldsymbol{\varphi}(\mathbf{x})$ from $\mathbb{R}^d$ to $\mathbb{R}^e$ for some suitable $e$ such that

$$ K(\mathbf{x}, \boldsymbol{\xi}) = \boldsymbol{\varphi}^T(\mathbf{x}) \boldsymbol{\varphi}(\boldsymbol{\xi}) $$

is a valid SVM kernel on $T$, or explain why no such function $\boldsymbol{\varphi}$ exists.

Problem 1.2 (Exam Style)¶

Let $\mathbf{x}$ and $\boldsymbol{\xi}$ be vectors in $\mathbb{R}^3$ and consider the following function from $\mathbb{R}^3\times \mathbb{R}^3$ to $\mathbb{R}$:

$$ K(\mathbf{x}, \boldsymbol{\xi}) = (\mathbf{x}^T\boldsymbol{\xi} + c)^2 $$

where $c$ is a real number.

Show that $K$ is a valid SVM kernel by giving a function $\boldsymbol{\varphi}(\mathbf{x})$ from $\mathbb{R}^d$ to $\mathbb{R}^e$ for some suitable $e$ such that

$$ K(\mathbf{x}, \boldsymbol{\xi}) = \boldsymbol{\varphi}^T(\mathbf{x}) \boldsymbol{\varphi}(\boldsymbol{\xi}) \;. $$

Also state the value of $e$ explicitly.

Hint¶

Spell out the square in terms of the components $x_1, x_2, x_3$ of $\mathbf{x}$ and $\xi_1, \xi_2, \xi_3$ of $\boldsymbol{\xi}$ and be patient.

Problem 1.3 (Exam Style)¶

Looking at the form of the solution to the previous problem, the value of $e$ should make sense to you.

Explain why, and give a formula for the value of $e$ for the function

$$ K(\mathbf{x}, \boldsymbol{\xi}) = (\mathbf{x}^T\boldsymbol{\xi} + c)^k $$

where now $\mathbf{x}$ and $\boldsymbol{\xi}$ are in $\mathbb{R}^d$ rather than in $\mathbb{R}^3$ and $k$ is an arbitrary positive integer.

(This new function $K$ is also a valid SVM kernel.)

Problem 1.4¶

The Gaussian kernel is defined in equation (13) of the class notes on kernels.

Let $X$ be an $n\times d$ random matrix representing $n$ data points in $\mathbb{R}^d$. Write a function with header

def check_rbf_kernel(x, sigma=1.):

that takes a numpy array representing a matrix like $X$ as its first argument and an optional value for the parameter $\sigma$ of the kernel as its second argument. The function verifies numerically if the Gaussian kernel is a valid SVM kernel on $X$. Specifically, the code should print 'ok' if it is and 'not ok' otherwise.

Show your code and what it prints when $X$ is defined as follows:

In [1]:

import numpy as np

n, d = 20, 30
x = np.random.randn(n, d)

Use the default value for sigma when you run the function.

Part 2: The Representer Theorem¶

After a couple of theoretical questions, this part of the assignment revisits the ad_data training set we used in Homework 6. You may want to read its description there to refresh your memory. Let's load the dataset:

In [2]:

from urllib.request import urlretrieve
from os import path as osp
import pickle


def retrieve(file_name, semester='fall22', course='371', homework=8):
    if osp.exists(file_name):
        print('Using previously downloaded file {}'.format(file_name))
    else:
        fmt = 'https://www2.cs.duke.edu/courses/{}/compsci{}/homework/{}/{}'
        url = fmt.format(semester, course, homework, file_name)
        urlretrieve(url, file_name)
        print('Downloaded file {}'.format(file_name))

In [3]:

ad_data_file = 'ad.pickle'
retrieve(ad_data_file, homework=6)
with open(ad_data_file, 'rb') as file:
    ad_data = pickle.load(file)

Using previously downloaded file ad.pickle

Problem 2.1 (Exam Style)¶

Suppose that you train a Logistic Regression Classifier (LRC) with regularization, as defined in Problem 1.3 of Homework 6. Does the representer theorem hold for the LRC risk function $L_{\text{reg}}(\mathbf{v})$ defined in that problem? Explain briefly why or why not.

Problem 2.2 (Exam Style)¶

Suppose now that you train the LRC without regularization, that is, by minimizing the cross-entropy risk alone:

$$ L_T(\mathbf{v}) = - \frac{1}{N} \sum_{n=1}^N [y_n\log p_n + (1 - y_n)\log(1 - p_n)] $$

where

$$ p_n = \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x}_n - b}} \;. $$

Omitting regularization is OK if the data is not linearly separable.

Does the representer theorem hold for the cross-entropy risk $L_T(\mathbf{v})$?

If it does, explain briefly but clearly why and, if necessary, show how the proof of the representer theorem given in the class notes needs to be adapted to work for $L_T(\mathbf{v})$.

If the theorem does not hold, state which assumption(s) are violated and explain briefly but clearly where and how the proof of the theorem given in the class notes breaks down.

Problem 2.3¶

Let us now experiment with the ad_data retrieved earlier. The following function, adapted from Homework 7, evaluates the classifier h, whose name is in h_name (we will set h_name to either 'LRC' or 'SVM'), on both the training set and the test set in the dictionary data.

In [4]:

def evaluate(h, data, h_name):
    def accuracy(s):
        sx, sy = s['x'], s['y']
        return h.score(sx, sy) * 100

    train, test = data['train'], data['test']
    f = '{:s}:\n\ttraining accuracy is {:.2f} percent,' +\
        '\n\ttest accuracy is {:.2f} percent'
    print(f.format(h_name, accuracy(train), accuracy(test)))

If the representer theorem holds for the risk function used to train h, then the optimal value $\mathbf{w}^\ast$ of $\mathbf{w}$ can be written as

$$ \mathbf{w}^\ast = \sum_{n=0}^{N-1} \beta_n \mathbf{x}_n $$

(we use pythonic numbering of the data points $\mathbf{x}_0, \ldots, \mathbf{x}_{N-1}$ here).

Therefore, we can check if the theorem holds by first using numpy.linalg.lstsq to compute the coefficients $\beta_n$ from the data points $\mathbf{x}_n$ and the value $\mathbf{w}^\ast$ computed by the training algorithm; and then computing the residual

$$ r = \left\|\mathbf{w}^\ast - \sum_{n=0}^{N-1} \beta_n \mathbf{x}_n\right\| \;. $$

If the theorem holds, then the residual should be very close to zero (to within numerical rounding errors).

Write a function with header

def experiment(h, data, h_name):

whose arguments are similar to those of evaluate. The only difference is that h is an untrained classifier instance when passed to experiment and a trained instance when passed to evaluate. The function experiment does the following:

It uses sklearn.model_selection.GridSearchCV and h.fit to determine a regularization constant C for the classifier h (you may assume that any classifier we use has such a constant) and train h. The values used for this cross-validation procedure are the values in the variable cs defined as follows:
```
  cs = np.logspace(-3, 2, 15)
```
All other parameters to GridSearchCV are left to their default values.
Let h = h.best_estimator_ be the classifier found through cross-validation. The function experiment should print the following information for h, if present, in the form of understandable printouts, one line for each bullet below:
- Training accuracy (use evaluate).
- Test accuracy (also printed by evaluate in the same call as for the training accuracy).
- Best value of the regularization constant $C$ found by cross-validation (this is h.C).
- The number $n$ of training samples (not test samples) and their dimensionality $d$.
- The number of linearly independent training samples (use numpy.linalg.rank).
- The residual $r$ for the optimal vector $\mathbf{w}^\ast$ found by the training algorithm. This vector can be retrieved as w = h.coef_.flatten() . A very small residual here would signal that the representer theorem holds.
- The residual $r$ for a random vector $\mathbf{w} =$ np.random.randn(len(w)). The point of printing this value is to show that not every vector can be written as a linear combination of the training data points.
The items in the first five bullets above (training accuracy; test accuracy; best C; $n$ and $d$; linearly independent training samples) are printed in all cases. The remaining ones can only be printed for linear classifiers, because other classifiers don't have a $\mathbf{w}$ parameter. Since you will run experiment also on a nonlinear classifier in a later problem, put the code that computes and prints these remaining items in an if statement:
```
  if hasattr(h, 'coef_'):
```
The function experiment returns h, w if the h.coef_ attribute is present. Otherwise it returns just h.

Show your code ad the result of the following call:

experiment(LogisticRegression(max_iter=1000), ad_data, 'LRC');

where LogisticRegression is imported from sklearn.linear_model.

Then answer the following questions:

Does the representer theorem hold, and how can you tell?
Why is it effectively impossible for the residual to be very close to zero for a random vector w?

Programming Notes¶

The function numpy.linalg.lstsq is supposed to return the residual $r$. However, the definition of this function has a thinko (a thinking "typo"): According to the documentation, when the rank of the system matrix is smaller than the number of columns or the matrix has fewer rows than columns, then the output residuals is an empty array.

This makes no sense, since the residual is fully well defined even in those cases. However, this is what the function actually does. As a consequence, you cannot rely on the value returned in residuals, and you need to compute the residual yourself. You may want to define a helper function that does that, since you will need to compute residuals in other problems later on.

The call of experiment above ignores the values that the function returns, and the semicolon at the end of the call prevents the returned values from printing. You will use the returned values in a later call of this function.

The function evaluate prints the h_name and a colon on a line by itself, as a header. It then prints training and test accuracy on two separate lines indented with a single tab (\t). Please start each of your own printed lines with a tab as well, so that all outputs are properly grouped under the header.

Problem 2.4¶

Run the call

svm, w_svm = experiment(SVC(kernel='linear'), ad_data, 'SVM')

where SVC is imported from sklearn.svm.

Then write code that prints out the following (one understandable line per bullet, starting with a tab):

The number of support vectors (these are listed in svm.support_vectors_) and how many of these are linearly independent.
The residual $r$ obtained when instead of using all the training data points $\mathbf{x}_n$ you just use the support vectors.
The residual $r$ obtained for a random vector (as before) when using the support vectors alone.

Finally, answer the following questions in four separate bullets:

Are the test accuracies of LRC and SVM similar for these experiments? What is the reason? Discrepancies smaller than one percent are negligible.
Do these classifiers overfit?
What is the minimum number of data points needed to represent $\mathbf{w}^\ast$ through the representer theorem for the LRC in this experiment?
What is the minimum number of data points needed to represent $\mathbf{w}^\ast$ through the representer theorem for the SVM in this experiment?

Part 3: Linear and Nonlinear SVMs¶

The first problem below examines the use of an RBF SVM (that is, an SVM that uses Gaussian kernels) on the ad_data. This experiment shows that fancier estimators are not necessarily better.

To visualize things, the subsequent problems contrast linear SVMs and RBF SVMs on synthetic data that live in $\mathbb{R}^2$. This data is retrieved next.

In [5]:

data_2d_file_name = 'data.pickle'
retrieve(data_2d_file_name)
with open(data_2d_file_name, 'rb') as file:
    data_2d = pickle.load(file)

Using previously downloaded file data.pickle

For visualization, the code below retrieves and imports a function show_classification that is adapted from one used in the sample solution for Homework 7.

In [6]:

show_file = 'show.py'
retrieve(show_file)
from show import show_classification

Using previously downloaded file show.py

A typical call of show_classification has the following format:

show_classification(linear_svm, data_2d['train'], 'training', 'linear SVM')

with hopefully obvious meaning of the arguments. The resulting diagram shows reference margins as dashed curves on either side of the decision boundary, which is shown as one or more solid curve. The meaning of the data point colors and markers is as in Homework 7.

Problem 3.1 (Exam Style except for Running the Code)¶

Run

experiment(SVC(kernel='rbf'), ad_data, 'RBF SVM');

Then answer the following questions in two separate bullets:

How does the RBF SVM do in terms of test accuracy compared to the linear SVM?
Briefly justify differences or similarities in test performance between RBF SVM and linear SVM. Refer to both training and test accuracy in your explanation.

Again, differences that are smaller than one percent are negligible.

Problem 3.2 (Exam Style except for Running the Code)¶

Run

linear_svm, _ = experiment(SVC(kernel='linear'), data_2d, 'linear SVM')
show_classification(linear_svm, data_2d['train'], 'training', 'linear SVM')

and then, in a separate cell (for easier pagination), run

show_classification(linear_svm, data_2d['test'], 'test', 'linear SVM')

Then answer the following questions in six separate bullets:

Is the data separable? How can you tell from just accuracy values?
Does the classifier generalize well? How can you tell from just accuracy values?
Is the number of support vectors small relative to the size of the training set?
Is the reference margin wide relative to the support of the training data distribution?
Are the answers to the previous two questions related to each other? If so, how? Specifically, why is the reference margin as narrow or wide as it is?
Why does the representer-theorem representation hold also for a random vector for this problem?

Problem 3.3 (Exam Style except for Running the Code)¶

Run

rbf_svm = experiment(SVC(kernel='rbf'), data_2d, 'RBF SVM')
show_classification(rbf_svm, data_2d['train'], 'training', 'RBF SVM')
show_classification(rbf_svm, data_2d['test'], 'test', 'RBF SVM')

(again split across two cells).

Then answer the following questions in five separate bullets:

Does the RBF SVM perform better than the linear SVM in terms of test accuracy? What is the reason for the difference in performance between the two SVMs?
Does the RBF SVM generalize well? How can you tell from just accuracy values?
Is the number of support vectors for the RBF SVM smaller than for the linear SVM for this classification problem?
Is the reference margin for the RBF SVM narrower than for the linear SVM for this classification problem?
Are the answers to the previous two questions related to each other? If so, how?

In [ ]: