When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.
Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.
As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.
Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?
This question requires no explicit answer from you, and no space is provided for an answer in the template file.
However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.
Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.
Anything that is stated in the class notes about SVMs and kernels can be assumed to be known and requires no proof for this part of the assignment.
Consider the following objects:
These equalities involve the first two samples out of $T$.
Either construct a function $\boldsymbol{\varphi}(\mathbf{x})$ from $\mathbb{R}^d$ to $\mathbb{R}^e$ for some suitable $e$ such that
$$ K(\mathbf{x}, \boldsymbol{\xi}) = \boldsymbol{\varphi}^T(\mathbf{x}) \boldsymbol{\varphi}(\boldsymbol{\xi}) $$is a valid SVM kernel on $T$, or explain why no such function $\boldsymbol{\varphi}$ exists.
Let $\mathbf{x}$ and $\boldsymbol{\xi}$ be vectors in $\mathbb{R}^3$ and consider the following function from $\mathbb{R}^3\times \mathbb{R}^3$ to $\mathbb{R}$:
$$ K(\mathbf{x}, \boldsymbol{\xi}) = (\mathbf{x}^T\boldsymbol{\xi} + c)^2 $$where $c$ is a real number.
Show that $K$ is a valid SVM kernel by giving a function $\boldsymbol{\varphi}(\mathbf{x})$ from $\mathbb{R}^d$ to $\mathbb{R}^e$ for some suitable $e$ such that
$$ K(\mathbf{x}, \boldsymbol{\xi}) = \boldsymbol{\varphi}^T(\mathbf{x}) \boldsymbol{\varphi}(\boldsymbol{\xi}) \;. $$Also state the value of $e$ explicitly.
Spell out the square in terms of the components $x_1, x_2, x_3$ of $\mathbf{x}$ and $\xi_1, \xi_2, \xi_3$ of $\boldsymbol{\xi}$ and be patient.
Looking at the form of the solution to the previous problem, the value of $e$ should make sense to you.
Explain why, and give a formula for the value of $e$ for the function
$$ K(\mathbf{x}, \boldsymbol{\xi}) = (\mathbf{x}^T\boldsymbol{\xi} + c)^k $$where now $\mathbf{x}$ and $\boldsymbol{\xi}$ are in $\mathbb{R}^d$ rather than in $\mathbb{R}^3$ and $k$ is an arbitrary positive integer.
(This new function $K$ is also a valid SVM kernel.)
The Gaussian kernel is defined in equation (13) of the class notes on kernels.
Let $X$ be an $n\times d$ random matrix representing $n$ data points in $\mathbb{R}^d$. Write a function with header
def check_rbf_kernel(x, sigma=1.):
that takes a numpy
array representing a matrix like $X$ as its first argument and an optional value for the parameter $\sigma$ of the kernel as its second argument. The function verifies numerically if the Gaussian kernel is a valid SVM kernel on $X$. Specifically, the code should print 'ok
' if it is and 'not ok'
otherwise.
Show your code and what it prints when $X$ is defined as follows:
import numpy as np
n, d = 20, 30
x = np.random.randn(n, d)
Use the default value for sigma
when you run the function.
After a couple of theoretical questions, this part of the assignment revisits the ad_data
training set we used in Homework 6. You may want to read its description there to refresh your memory. Let's load the dataset:
from urllib.request import urlretrieve
from os import path as osp
import pickle
def retrieve(file_name, semester='fall22', course='371', homework=8):
if osp.exists(file_name):
print('Using previously downloaded file {}'.format(file_name))
else:
fmt = 'https://www2.cs.duke.edu/courses/{}/compsci{}/homework/{}/{}'
url = fmt.format(semester, course, homework, file_name)
urlretrieve(url, file_name)
print('Downloaded file {}'.format(file_name))
ad_data_file = 'ad.pickle'
retrieve(ad_data_file, homework=6)
with open(ad_data_file, 'rb') as file:
ad_data = pickle.load(file)
Using previously downloaded file ad.pickle
Suppose that you train a Logistic Regression Classifier (LRC) with regularization, as defined in Problem 1.3 of Homework 6. Does the representer theorem hold for the LRC risk function $L_{\text{reg}}(\mathbf{v})$ defined in that problem? Explain briefly why or why not.
Suppose now that you train the LRC without regularization, that is, by minimizing the cross-entropy risk alone:
$$ L_T(\mathbf{v}) = - \frac{1}{N} \sum_{n=1}^N [y_n\log p_n + (1 - y_n)\log(1 - p_n)] $$where
$$ p_n = \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x}_n - b}} \;. $$Omitting regularization is OK if the data is not linearly separable.
Does the representer theorem hold for the cross-entropy risk $L_T(\mathbf{v})$?
If it does, explain briefly but clearly why and, if necessary, show how the proof of the representer theorem given in the class notes needs to be adapted to work for $L_T(\mathbf{v})$.
If the theorem does not hold, state which assumption(s) are violated and explain briefly but clearly where and how the proof of the theorem given in the class notes breaks down.
Let us now experiment with the ad_data
retrieved earlier. The following function, adapted from Homework 7, evaluates the classifier h
, whose name is in h_name
(we will set h_name
to either 'LRC'
or 'SVM'
), on both the training set and the test set in the dictionary data
.
def evaluate(h, data, h_name):
def accuracy(s):
sx, sy = s['x'], s['y']
return h.score(sx, sy) * 100
train, test = data['train'], data['test']
f = '{:s}:\n\ttraining accuracy is {:.2f} percent,' +\
'\n\ttest accuracy is {:.2f} percent'
print(f.format(h_name, accuracy(train), accuracy(test)))
If the representer theorem holds for the risk function used to train h
, then the optimal value $\mathbf{w}^\ast$ of $\mathbf{w}$ can be written as
(we use pythonic numbering of the data points $\mathbf{x}_0, \ldots, \mathbf{x}_{N-1}$ here).
Therefore, we can check if the theorem holds by first using numpy.linalg.lstsq
to compute the coefficients $\beta_n$ from the data points $\mathbf{x}_n$ and the value $\mathbf{w}^\ast$ computed by the training algorithm; and then computing the residual
If the theorem holds, then the residual should be very close to zero (to within numerical rounding errors).
Write a function with header
def experiment(h, data, h_name):
whose arguments are similar to those of evaluate
. The only difference is that h
is an untrained classifier instance when passed to experiment
and a trained instance when passed to evaluate
. The function experiment
does the following:
It uses sklearn.model_selection.GridSearchCV
and h.fit
to determine a regularization constant C
for the classifier h
(you may assume that any classifier we use has such a constant) and train h
. The values used for this cross-validation procedure are the values in the variable cs
defined as follows:
cs = np.logspace(-3, 2, 15)
All other parameters to GridSearchCV
are left to their default values.
Let h = h.best_estimator_
be the classifier found through cross-validation. The function experiment
should print the following information for h
, if present, in the form of understandable printouts, one line for each bullet below:
evaluate
).evaluate
in the same call as for the training accuracy).h.C
).numpy.linalg.rank
).w = h.coef_.flatten()
. A very small residual here would signal that the representer theorem holds.np.random.randn(len(w))
. The point of printing this value is to show that not every vector can be written as a linear combination of the training data points.The items in the first five bullets above (training accuracy; test accuracy; best C
; $n$ and $d$; linearly independent training samples) are printed in all cases. The remaining ones can only be printed for linear classifiers, because other classifiers don't have a $\mathbf{w}$ parameter. Since you will run experiment
also on a nonlinear classifier in a later problem, put the code that computes and prints these remaining items in an if
statement:
if hasattr(h, 'coef_'):
The function experiment
returns h, w
if the h.coef_
attribute is present. Otherwise it returns just h
.
Show your code ad the result of the following call:
experiment(LogisticRegression(max_iter=1000), ad_data, 'LRC');
where LogisticRegression
is imported from sklearn.linear_model
.
Then answer the following questions:
w
?The function numpy.linalg.lstsq
is supposed to return the residual $r$. However, the definition of this function has a thinko (a thinking "typo"): According to the documentation, when the rank of the system matrix is smaller than the number of columns or the matrix has fewer rows than columns, then the output residuals
is an empty array.
This makes no sense, since the residual is fully well defined even in those cases. However, this is what the function actually does. As a consequence, you cannot rely on the value returned in residuals
, and you need to compute the residual yourself. You may want to define a helper function that does that, since you will need to compute residuals in other problems later on.
experiment
above ignores the values that the function returns, and the semicolon at the end of the call prevents the returned values from printing. You will use the returned values in a later call of this function.evaluate
prints the h_name
and a colon on a line by itself, as a header. It then prints training and test accuracy on two separate lines indented with a single tab (\t
). Please start each of your own printed lines with a tab as well, so that all outputs are properly grouped under the header.Run the call
svm, w_svm = experiment(SVC(kernel='linear'), ad_data, 'SVM')
where SVC
is imported from sklearn.svm
.
Then write code that prints out the following (one understandable line per bullet, starting with a tab):
svm.support_vectors_
) and how many of these are linearly independent.Finally, answer the following questions in four separate bullets:
The first problem below examines the use of an RBF SVM (that is, an SVM that uses Gaussian kernels) on the ad_data
. This experiment shows that fancier estimators are not necessarily better.
To visualize things, the subsequent problems contrast linear SVMs and RBF SVMs on synthetic data that live in $\mathbb{R}^2$. This data is retrieved next.
data_2d_file_name = 'data.pickle'
retrieve(data_2d_file_name)
with open(data_2d_file_name, 'rb') as file:
data_2d = pickle.load(file)
Using previously downloaded file data.pickle
For visualization, the code below retrieves and imports a function show_classification
that is adapted from one used in the sample solution for Homework 7.
show_file = 'show.py'
retrieve(show_file)
from show import show_classification
Using previously downloaded file show.py
A typical call of show_classification
has the following format:
show_classification(linear_svm, data_2d['train'], 'training', 'linear SVM')
with hopefully obvious meaning of the arguments. The resulting diagram shows reference margins as dashed curves on either side of the decision boundary, which is shown as one or more solid curve. The meaning of the data point colors and markers is as in Homework 7.
Run
experiment(SVC(kernel='rbf'), ad_data, 'RBF SVM');
Then answer the following questions in two separate bullets:
Again, differences that are smaller than one percent are negligible.
Run
linear_svm, _ = experiment(SVC(kernel='linear'), data_2d, 'linear SVM')
show_classification(linear_svm, data_2d['train'], 'training', 'linear SVM')
and then, in a separate cell (for easier pagination), run
show_classification(linear_svm, data_2d['test'], 'test', 'linear SVM')
Then answer the following questions in six separate bullets:
Run
rbf_svm = experiment(SVC(kernel='rbf'), data_2d, 'RBF SVM')
show_classification(rbf_svm, data_2d['train'], 'training', 'RBF SVM')
show_classification(rbf_svm, data_2d['test'], 'test', 'RBF SVM')
(again split across two cells).
Then answer the following questions in five separate bullets: