COMPSCI 371 Homework 7¶

Homework Submission Workflow¶

When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.

Note on Exam-Style Problems¶

Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.

As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.

Problem 0 (3 points)¶

Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?

Note¶

This question requires no explicit answer from you, and no space is provided for an answer in the template file.

However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.

Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.

Part 1: Linear Support Vector Machine Basics¶

A linear Support Vector Machine (SVM) $h : X \rightarrow Y$ for a binary classification problem with label set $Y = \{-1, 1\}$ and one-dimensional data space $X = \mathbb{R}$ is trained on the toy training set

$$ T = \{(0, -1), (1, -1), (2, 1), (3, 1), (4, 1) \} $$

to determine the parameters $w, b$ of $h$.

The empirical risk for this problem has the form

$$ L_T(w, b) = \frac{1}{2}w^2 + C_0 H(w, b) $$

where $H(w, b)$ is the average hinge loss on the training set and $C_0$ is a given constant.

Problem 1.1 (Exam Style)¶

Explain in a brief sentence why for this specific training set $T$ there exist values for the parameters $w, b$ for which the average hinge loss $H(w, b)$ is zero.

Problem 1.2 (Exam Style)¶

Suppose that the constant $C_0$ is so large that the optimal solution $w^\ast, b^\ast$ that result from training $h$ on the given set $T$ achieves an average hinge loss $H(w^\ast, b^\ast)$ of zero. Give the numerical value of the reference margin $\mu^{\ast}$ of $h$ for this optimal solution.

[Hint: Where is the decision boundary with maximum reference margin, given that $H(w^\ast, b^\ast)=0$?]

Problem 1.3 (Exam Style)¶

Give the numerical values of the support vectors corresponding to the optimal solution $w^\ast, b^\ast$.

Problem 1.4 (Exam Style)¶

Give the numerical value of the optimal parameters $w^\ast, b^\ast$ of $h$, assuming that $C_0$ is as specified in problem 1.2 above.

Hint: How is $w^\ast$ related to $\mu^\ast$?

Problem 1.5 (Exam Style)¶

Assume that the five samples in $T$ are numbered from 0 to 4, in the order they are given above. Give the numerical values $\mu(x_0, y_0), \ldots, \mu(x_4, y_4)$ of the margin of each of these five samples for the optimal SVM parameters $w^\ast, b^\ast$ found above.

Part 2: Binary Linear SVMs and LRCs¶

The main reason for the popularity of SVMs is that they can be adapted to problems with nonlinear decision boundaries. We will see how this is done when we study SVMs with kernels.

If we stick to the linear version of SVMs, there is typically little difference in practice between the performance of SVMs and Logistic Regression Classifiers (LRCs), but with possible exceptions. We explore these points in this part of the assignment, limiting ourselves to binary classification problems for simplicity.

Problem 2.1¶

First, let us understand why binary LRCs and binary linear SVMs perform very similarly. For a fair comparison, we consider LRCs with regularization, which we addressed in part 1 of homework 6. You will recall that these LRCs minimize the risk

$$ L_{LRC}(\mathbf{v}) = C L_T^{(ce)}(\mathbf{v}) + \|\mathbf{v}\|^2 $$

where $\mathbf{v} = (\mathbf{w}, b)$ is the vector of activation parameters and $L_T^{(ce)}(\mathbf{v})$ is the cross-entropy risk.

On the other hand, SVMs minimize the following risk:

$$ L_{SVM}(\mathbf{v}) = C L_T^{(h)}(\mathbf{v}) + \|\mathbf{w}\|^2 $$

where $L_T^{(h)}$ is the hinge risk.

The regularization terms $\|\mathbf{v}\|^2$ and $\|\mathbf{w}\|^2$ in the two risk definitions are very similar, so we should expect them to have a similar effect: They both tend to make $\mathbf{w}$ small.

The cross-entropy risk $L_T^{(ce)}(\mathbf{v})$ is the average of the cross-entropy loss

$$ \ell_{ce}(y, \alpha) = -y\log f(\alpha) -(1-y)\log(1-f(\alpha)) $$

composited with the logistic regression scoring function

$$ f(\alpha) = \frac{1}{1 + e^{-\alpha}} $$

where $\alpha = \mathbf{w}^T\mathbf{x} + b$ is the activation.

The hinge risk $L_T^{(h)}(\mathbf{v})$ is the average of the hinge loss

$$ \ell_{h}(y, \alpha) = \max(0, 1 - y\alpha) $$

where $\alpha = \mathbf{w}^T\mathbf{x} + b$ is again the activation.

Your Task¶

Write code to plot $\ell_{ce}(1, \alpha)$ and $\ell_{h}(1, \alpha)$ superimposed on the same diagram and for $\alpha\in[-3, 3]$.

Label the horizontal axis and add a legend that shows which curve is for which loss.

Notes¶

This diagram is meant to illustrate that the two loss functions are very similar to each other. Of course, plots for $y=-1$ are similar, so we will not need to look at those.
In homework 6 we saw a way to make the computation of $\ell_{ce}(1, \alpha)$ numerically stable.

Problem 2.2¶

Because the loss functions for LRCs and linear SVMs are so similar to each other, the two classifiers have very similar performance (in terms of classification accuracy) in most cases. To see this, let us consider a "traditional" pair of training and test data sets, retrieved as usual below.

In [1]:

from urllib.request import urlretrieve
from os import path as osp
import pickle


def retrieve(file_name, semester='fall22', course='371', homework=7):
    if osp.exists(file_name):
        print('Using previously downloaded file {}'.format(file_name))
    else:
        fmt = 'https://www2.cs.duke.edu/courses/{}/compsci{}/homework/{}/{}'
        url = fmt.format(semester, course, homework, file_name)
        urlretrieve(url, file_name)
        print('Downloaded file {}'.format(file_name))

In [2]:

balanced_file_name = 'balanced.pickle'
retrieve(balanced_file_name)

Using previously downloaded file balanced.pickle

In [3]:

with open(balanced_file_name, 'rb') as file:
    balanced_data = pickle.load(file)

As usual, the dataset in balanced_data is a dictionary with two keys train and test, each of which corresponds to a value that is in turn a dictionary with keys x and y. If there are $n$ samples in one of these dictionaries, then x is a numpy array of shape $(n, 2)$ (that is, the data is in $\mathbb{R}^2$) and y is a numpy array of shape $(n,)$ with entries in $Y = \{-1, 1\}$. Here is a plot of the data:

In [4]:

from matplotlib import pyplot as plt
%matplotlib inline
import numpy as np


def show_data(data):
    plt.figure(figsize=(10, 5), tight_layout=True)
    fs = 12
    for plot, s in enumerate(('train', 'test')):
        plt.subplot(1, 2, plot + 1)
        x, y = data[s]['x'], data[s]['y']
        n = len(y)
        labels, colors = (-1, 1), ('r', 'b')
        for label, color in zip(labels, colors):
            k = y == label
            plt.plot(x[k, 0], x[k, 1], 'o' + color,
                     label='y = {}'.format(label), ms=3)
        plt.legend(fontsize=fs)
        plt.axis('equal')
        plt.axis('off')
        suffix = 'ing' if s == 'train' else ''
        plt.title('{}{} set'.format(s, suffix), fontsize=fs)
    plt.show()

In [5]:

show_data(balanced_data)

This dataset is "traditional" in the sense that nothing in particular sticks out. The two classes are balanced, which means they have similar numbers of samples; areas in data space with positive and negative samples overlap somewhat; and the distribution of the test data is quite similar to that of the training data. The latter aspect is consistent with the fundamental assumption made in machine learning, namely, that all the data is drawn uniformly and at random out of the same (typically unknown) joint probability distribution $p(\mathbf{x}, y)$.

Your Task¶

Write a function with header

def experiment(data_set, h, h_name):

that

trains either an LRC or an SVM (passed to the argument h) on the training portion of data_set (which is a dictionary like balanced_data);
prints out accuracy values as percentages (with two decimal digits after the period) on both training and test data for each classifier; and
displays the classification results according to the plot specifications and programming notes below. The argument h_name is either LRC or SVM, which is useful to draw a meaningful title for each figure.

Show your code and what it does as a result of the calls

experiment(balanced_data, LogisticRegression(), 'LRC')
experiment(balanced_data, SVC(kernel='linear'), 'SVM')

Each call should be in a separate notebook cell for easier pagination.

Then comment briefly on (i) what limits the accuracy of the classifiers in these two experiments; (ii) whether the classifiers generalize well; and (iii) how well the LRC and the SVM do relative to each other.

Note¶

This problem (and the next) uses the default values for the scikit-learn implementations of the LRC and the SVM, except that kernel=linear is specified for the SVM. While it would be interesting to explore what happens with different amounts of regularization (the constant $C_0$ in the class notes and the parameter C in the scikit-learn classifiers), this analysis would be too complex for a homework assignment.

Plot Specifications¶

The two plots (for training and test data, respectively) should be drawn next to each other as two subplots plt.subplot(1, 2, k).
A title on each subplot should specify which classifier is being used (LRC or SVM) on which data (training or test).
The decision regions are shaded red for negative labels and blue for positive. This shading is approximate, as it is obtained by running the classifier on a grid of points on the plane. See programming notes below.
Positive data samples ($y = 1$) are blue.
Negative data samples ($y = -1$) are red.
Correctly classified data samples are circles (marker='o').
Incorrectly classified data samples are triangles (marker='v').
For the SVM, markers for support vectors in the training set are filled (fillstyle='full'). Otherwise, all markers are hollow (fillstyle='none').
For the SVM, the function experiment should also print out how many support vectors were found.
The decision boundary is drawn as a black solid line.
For the SVM, two black dashed lines denote lines that are one reference margin $\mu^\ast$ away from the decision boundary.

Note that the specifications above envision four different combinations of color and marker type for the data samples (in addition to solid/hollow for the SVM):

True positives are data samples that are truly positive and are predicted as positives. They are drawn as blue circles.
True negatives are data samples that are truly negative and are predicted as negatives. They are drawn as red circles.
False positives are data samples that are truly negative but are predicted as positives. They are drawn as red triangles.
False negatives are data samples that are truly positive but are predicted as negatives. They are drawn as blue triangles.

Make sure you get these combinations right. Pay particular attention to the last two categories: For instance, false positives means that the samples are falsely classified as positive. Thus, they are really negative, and should be colored red.

Programming Notes¶

An easy way to color the decision regions (approximately) is to create a grid of points on the plane, say, spaced 0.01 length units apart. Then run the classifier on each of the grid points to find out what decision region the grid point belongs to. Finally, use the function matplotlib.pyplot.contourf with cmap=matplotlib.pyplot.RdBu (the red-blue colormap) and with alpha=0.3 to color the region. This value of alpha specifies a low color opacity, so that we can still see the data points if we draw them on top of the lightly colored regions.
You can use the grid above also to plot the decision boundary. To this end, note that you can retrieve the decision function (what we call the score function in class) as h.decision_function from the trained classifier h. The decision boundary is where the decision function is zero, so you can plot it as a solid black line with plt.contour (not contourf) using levels=[0], colors='k', and linestyles=['-'].
For the SVM, you can use the same technique to draw the margin lines, which are the lines where the decision function takes on values 1 and -1. Dashed lines are obtained with linestyles=['--'] or, if you draw both lines at the same time, linestyles=['--','--'].
For the SVM, in order to know how many support vectors were found, and to use the correct marker type for support vectors, note that the attribute h.support_ of the trained classifier is a list of indices to the support vectors in data_set['train']['x'] and data_set['train']['y'] .
To print out accuracy values, note that the attribute h.score of the trained classifier is a function that computes accuracy in $[0, 1]$, so you just need to multiply the values it returns by 100 to get the accuracy as a percent.
It is advisable to organize your code with the aid of some helper functions. Below are the headers of the helper functions used in the sample solution. Specifically, show_split plots the data similarly to the function show_data defined earlier, but with different markers depending on the fate of a sample (true positive, true negative, false positive, false negative). The function show_classification calls show_split and then colors the decision regions and draws the decision boundary and, for the SVM, the margin lines. However, feel free to define your own helper functions instead of or in addition to these.
```
  def show_split(h, samples, marker='none'):
  def show_classification(h, data_set, data_set_type, description):
  def evaluate(h, train, test, description):
```

Problem 2.3¶

In the literature and online discussions you often read that the LRC and the linear SVM are effectively interchangeable in most applications that require a binary linear classifier. The experiments in the previous problem are an example of the situations where this belief comes from.

Even if this equivalence were always the case, SVMs become preferable when they are used with kernels. This combination makes SVMs very powerful predictors for classification problems with nonlinear decision boundaries.

In this problem, on the other hand, we explore a situation where SVMs may have an edge over LRCs even in the linear case, that is, without kernels. Since these two predictors perform similarly in "traditional" situations, we need to look at scenarios that stretch some of the basic assumptions made in machine learning.

One such scenario involves imbalanced data, in which one class has many fewer samples than the other. This is a very common situation. For instance, in most medical applications the "diseased" category has many fewer samples than the "healthy" category, since disease is fortunately less frequent than good health.

As an example, if you design a classifier that finds melanomas (skin cancer) by examining people's nevi (skin moles), you will easily find large amounts of moles, of which only a tiny fraction are cancerous. The training set is then imbalanced, and a test set that is drawn from the same distribution is similarly imbalanced.

The data set retrieved and displayed below is a toy example of an imbalanced set. It is an artificially created set, and has nothing to do with skin cancer.

In [6]:

imbalanced_file_name = 'imbalanced.pickle'
retrieve(imbalanced_file_name)

Using previously downloaded file imbalanced.pickle

In [7]:

with open(imbalanced_file_name, 'rb') as file:
    imbalanced_data = pickle.load(file)

In [8]:

show_data(imbalanced_data)

There are many fewer negatives (red, $y=-1$) than positives (blue, $y=1$). Two other features of this data are related to class imbalance. First, the data (both training and test) is separable. This is of course not always the case. However, with so few negative samples, the chance of data overlap decreases, especially in data spaces with high dimensionality where there is more "room" for the data points. In other words, separability is found more often when the data in either class (or both) is sparse than when both classes have many samples.

In addition, the distribution of the negative samples in the test set seems a bit different from that of the data in the training set. This is often the case when either data set is sparse, because a small number of samples cannot be a good representative for a complex distribution. In other words, it is conceivable that when drawing a small number of samples from a distribution multiple times one gets data whose empirical distribution is different, simply because the two small sets do not represent the full distribution well.

Your Task¶

Run

experiment(imbalanced_data, LogisticRegression(), 'LRC')
experiment(imbalanced_data, SVC(kernel='linear'), 'SVM')

(in separate notebook cells) and then comment on and explain the results.

Specifically, first answer questions similar to those in the previous problem: (i) does anything limit the accuracy of the classifiers in these two experiments; (ii) do the classifiers generalize well; and (iii) how well do the LRC and the SVM do relative to each other.

Then explain in detail why one of the two classifiers does better than the other. Your explanation should refer to what you observe in the solutions, and the placement of decision boundary in particular; how the classifiers are defined; what loss functions they use; and what properties the optimal predictors have as a consequence, and as pertains to the data in these experiments.

Notes¶

While we often ask for brief explanations, in this problem we are looking for more detail. There is no recipe. Just show to us that you undestand what is going on, that you have absorbed the main concepts about LRCs and SVMs, and that you are able to write clearly and articulately.
If anything seems at odds with what you learned about LRCs, please point that out. As an explanation, remember that the default value $C_0$ for both classifiers is 1 in scikit-learn. Thus, both classifiers are trained with regularization, while we studied LRCs without regularization in class. As mentioned earlier, LRCs with regularization were covered in homework 6. You need not explain why regularization leads to any discrepancy for the LRC.