When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.
Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.
As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.
Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?
This question requires no explicit answer from you, and no space is provided for an answer in the template file.
However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.
Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.
A linear Support Vector Machine (SVM) $h : X \rightarrow Y$ for a binary classification problem with label set $Y = \{-1, 1\}$ and one-dimensional data space $X = \mathbb{R}$ is trained on the toy training set
$$ T = \{(0, -1), (1, -1), (2, 1), (3, 1), (4, 1) \} $$to determine the parameters $w, b$ of $h$.
The empirical risk for this problem has the form
$$ L_T(w, b) = \frac{1}{2}w^2 + C_0 H(w, b) $$where $H(w, b)$ is the average hinge loss on the training set and $C_0$ is a given constant.
Explain in a brief sentence why for this specific training set $T$ there exist values for the parameters $w, b$ for which the average hinge loss $H(w, b)$ is zero.
Suppose that the constant $C_0$ is so large that the optimal solution $w^\ast, b^\ast$ that result from training $h$ on the given set $T$ achieves an average hinge loss $H(w^\ast, b^\ast)$ of zero. Give the numerical value of the reference margin $\mu^{\ast}$ of $h$ for this optimal solution.
[Hint: Where is the decision boundary with maximum reference margin, given that $H(w^\ast, b^\ast)=0$?]
Give the numerical values of the support vectors corresponding to the optimal solution $w^\ast, b^\ast$.
Give the numerical value of the optimal parameters $w^\ast, b^\ast$ of $h$, assuming that $C_0$ is as specified in problem 1.2 above.
Hint: How is $w^\ast$ related to $\mu^\ast$?
Assume that the five samples in $T$ are numbered from 0 to 4, in the order they are given above. Give the numerical values $\mu(x_0, y_0), \ldots, \mu(x_4, y_4)$ of the margin of each of these five samples for the optimal SVM parameters $w^\ast, b^\ast$ found above.
The main reason for the popularity of SVMs is that they can be adapted to problems with nonlinear decision boundaries. We will see how this is done when we study SVMs with kernels.
If we stick to the linear version of SVMs, there is typically little difference in practice between the performance of SVMs and Logistic Regression Classifiers (LRCs), but with possible exceptions. We explore these points in this part of the assignment, limiting ourselves to binary classification problems for simplicity.
First, let us understand why binary LRCs and binary linear SVMs perform very similarly. For a fair comparison, we consider LRCs with regularization, which we addressed in part 1 of homework 6. You will recall that these LRCs minimize the risk
$$ L_{LRC}(\mathbf{v}) = C L_T^{(ce)}(\mathbf{v}) + \|\mathbf{v}\|^2 $$where $\mathbf{v} = (\mathbf{w}, b)$ is the vector of activation parameters and $L_T^{(ce)}(\mathbf{v})$ is the cross-entropy risk.
On the other hand, SVMs minimize the following risk:
$$ L_{SVM}(\mathbf{v}) = C L_T^{(h)}(\mathbf{v}) + \|\mathbf{w}\|^2 $$where $L_T^{(h)}$ is the hinge risk.
The regularization terms $\|\mathbf{v}\|^2$ and $\|\mathbf{w}\|^2$ in the two risk definitions are very similar, so we should expect them to have a similar effect: They both tend to make $\mathbf{w}$ small.
The cross-entropy risk $L_T^{(ce)}(\mathbf{v})$ is the average of the cross-entropy loss
$$ \ell_{ce}(y, \alpha) = -y\log f(\alpha) -(1-y)\log(1-f(\alpha)) $$composited with the logistic regression scoring function
$$ f(\alpha) = \frac{1}{1 + e^{-\alpha}} $$where $\alpha = \mathbf{w}^T\mathbf{x} + b$ is the activation.
The hinge risk $L_T^{(h)}(\mathbf{v})$ is the average of the hinge loss
$$ \ell_{h}(y, \alpha) = \max(0, 1 - y\alpha) $$where $\alpha = \mathbf{w}^T\mathbf{x} + b$ is again the activation.
Write code to plot $\ell_{ce}(1, \alpha)$ and $\ell_{h}(1, \alpha)$ superimposed on the same diagram and for $\alpha\in[-3, 3]$.
Label the horizontal axis and add a legend that shows which curve is for which loss.
Because the loss functions for LRCs and linear SVMs are so similar to each other, the two classifiers have very similar performance (in terms of classification accuracy) in most cases. To see this, let us consider a "traditional" pair of training and test data sets, retrieved as usual below.
from urllib.request import urlretrieve
from os import path as osp
import pickle
def retrieve(file_name, semester='fall22', course='371', homework=7):
if osp.exists(file_name):
print('Using previously downloaded file {}'.format(file_name))
else:
fmt = 'https://www2.cs.duke.edu/courses/{}/compsci{}/homework/{}/{}'
url = fmt.format(semester, course, homework, file_name)
urlretrieve(url, file_name)
print('Downloaded file {}'.format(file_name))
balanced_file_name = 'balanced.pickle'
retrieve(balanced_file_name)
Using previously downloaded file balanced.pickle
with open(balanced_file_name, 'rb') as file:
balanced_data = pickle.load(file)
As usual, the dataset in balanced_data
is a dictionary with two keys train
and test
, each of which corresponds to a value that is in turn a dictionary with keys x
and y
. If there are $n$ samples in one of these dictionaries, then x
is a numpy
array of shape $(n, 2)$ (that is, the data is in $\mathbb{R}^2$) and y
is a numpy
array of shape $(n,)$ with entries in $Y = \{-1, 1\}$. Here is a plot of the data:
from matplotlib import pyplot as plt
%matplotlib inline
import numpy as np
def show_data(data):
plt.figure(figsize=(10, 5), tight_layout=True)
fs = 12
for plot, s in enumerate(('train', 'test')):
plt.subplot(1, 2, plot + 1)
x, y = data[s]['x'], data[s]['y']
n = len(y)
labels, colors = (-1, 1), ('r', 'b')
for label, color in zip(labels, colors):
k = y == label
plt.plot(x[k, 0], x[k, 1], 'o' + color,
label='y = {}'.format(label), ms=3)
plt.legend(fontsize=fs)
plt.axis('equal')
plt.axis('off')
suffix = 'ing' if s == 'train' else ''
plt.title('{}{} set'.format(s, suffix), fontsize=fs)
plt.show()
show_data(balanced_data)
This dataset is "traditional" in the sense that nothing in particular sticks out. The two classes are balanced, which means they have similar numbers of samples; areas in data space with positive and negative samples overlap somewhat; and the distribution of the test data is quite similar to that of the training data. The latter aspect is consistent with the fundamental assumption made in machine learning, namely, that all the data is drawn uniformly and at random out of the same (typically unknown) joint probability distribution $p(\mathbf{x}, y)$.
Write a function with header
def experiment(data_set, h, h_name):
that
h
) on the training portion of data_set
(which is a dictionary like balanced_data
);h_name
is either LRC
or SVM
, which is useful to draw a meaningful title for each figure.Show your code and what it does as a result of the calls
experiment(balanced_data, LogisticRegression(), 'LRC')
experiment(balanced_data, SVC(kernel='linear'), 'SVM')
Each call should be in a separate notebook cell for easier pagination.
Then comment briefly on (i) what limits the accuracy of the classifiers in these two experiments; (ii) whether the classifiers generalize well; and (iii) how well the LRC and the SVM do relative to each other.
This problem (and the next) uses the default values for the scikit-learn
implementations of the LRC and the SVM, except that kernel=linear
is specified for the SVM. While it would be interesting to explore what happens with different amounts of regularization (the constant $C_0$ in the class notes and the parameter C
in the scikit-learn
classifiers), this analysis would be too complex for a homework assignment.
plt.subplot(1, 2, k)
.marker='o'
).marker='v'
).fillstyle='full'
). Otherwise, all markers are hollow (fillstyle='none'
).experiment
should also print out how many support vectors were found.Note that the specifications above envision four different combinations of color and marker type for the data samples (in addition to solid/hollow for the SVM):
Make sure you get these combinations right. Pay particular attention to the last two categories: For instance, false positives means that the samples are falsely classified as positive. Thus, they are really negative, and should be colored red.
matplotlib.pyplot.contourf
with cmap=matplotlib.pyplot.RdBu
(the red-blue colormap) and with alpha=0.3
to color the region. This value of alpha
specifies a low color opacity, so that we can still see the data points if we draw them on top of the lightly colored regions.h.decision_function
from the trained classifier h
. The decision boundary is where the decision function is zero, so you can plot it as a solid black line with plt.contour
(not contourf
) using levels=[0]
, colors='k'
, and linestyles=['-']
.linestyles=['--']
or, if you draw both lines at the same time, linestyles=['--','--']
.h.support_
of the trained classifier is a list of indices to the support vectors in data_set['train']['x']
and data_set['train']['y']
.h.score
of the trained classifier is a function that computes accuracy in $[0, 1]$, so you just need to multiply the values it returns by 100 to get the accuracy as a percent.It is advisable to organize your code with the aid of some helper functions. Below are the headers of the helper functions used in the sample solution. Specifically, show_split
plots the data similarly to the function show_data
defined earlier, but with different markers depending on the fate of a sample (true positive, true negative, false positive, false negative). The function show_classification
calls show_split
and then colors the decision regions and draws the decision boundary and, for the SVM, the margin lines. However, feel free to define your own helper functions instead of or in addition to these.
def show_split(h, samples, marker='none'):
def show_classification(h, data_set, data_set_type, description):
def evaluate(h, train, test, description):
In the literature and online discussions you often read that the LRC and the linear SVM are effectively interchangeable in most applications that require a binary linear classifier. The experiments in the previous problem are an example of the situations where this belief comes from.
Even if this equivalence were always the case, SVMs become preferable when they are used with kernels. This combination makes SVMs very powerful predictors for classification problems with nonlinear decision boundaries.
In this problem, on the other hand, we explore a situation where SVMs may have an edge over LRCs even in the linear case, that is, without kernels. Since these two predictors perform similarly in "traditional" situations, we need to look at scenarios that stretch some of the basic assumptions made in machine learning.
One such scenario involves imbalanced data, in which one class has many fewer samples than the other. This is a very common situation. For instance, in most medical applications the "diseased" category has many fewer samples than the "healthy" category, since disease is fortunately less frequent than good health.
As an example, if you design a classifier that finds melanomas (skin cancer) by examining people's nevi (skin moles), you will easily find large amounts of moles, of which only a tiny fraction are cancerous. The training set is then imbalanced, and a test set that is drawn from the same distribution is similarly imbalanced.
The data set retrieved and displayed below is a toy example of an imbalanced set. It is an artificially created set, and has nothing to do with skin cancer.
imbalanced_file_name = 'imbalanced.pickle'
retrieve(imbalanced_file_name)
Using previously downloaded file imbalanced.pickle
with open(imbalanced_file_name, 'rb') as file:
imbalanced_data = pickle.load(file)
show_data(imbalanced_data)
There are many fewer negatives (red, $y=-1$) than positives (blue, $y=1$). Two other features of this data are related to class imbalance. First, the data (both training and test) is separable. This is of course not always the case. However, with so few negative samples, the chance of data overlap decreases, especially in data spaces with high dimensionality where there is more "room" for the data points. In other words, separability is found more often when the data in either class (or both) is sparse than when both classes have many samples.
In addition, the distribution of the negative samples in the test set seems a bit different from that of the data in the training set. This is often the case when either data set is sparse, because a small number of samples cannot be a good representative for a complex distribution. In other words, it is conceivable that when drawing a small number of samples from a distribution multiple times one gets data whose empirical distribution is different, simply because the two small sets do not represent the full distribution well.
Run
experiment(imbalanced_data, LogisticRegression(), 'LRC')
experiment(imbalanced_data, SVC(kernel='linear'), 'SVM')
(in separate notebook cells) and then comment on and explain the results.
Specifically, first answer questions similar to those in the previous problem: (i) does anything limit the accuracy of the classifiers in these two experiments; (ii) do the classifiers generalize well; and (iii) how well do the LRC and the SVM do relative to each other.
Then explain in detail why one of the two classifiers does better than the other. Your explanation should refer to what you observe in the solutions, and the placement of decision boundary in particular; how the classifiers are defined; what loss functions they use; and what properties the optimal predictors have as a consequence, and as pertains to the data in these experiments.
scikit-learn
. Thus, both classifiers are trained with regularization, while we studied LRCs without regularization in class. As mentioned earlier, LRCs with regularization were covered in homework 6. You need not explain why regularization leads to any discrepancy for the LRC.