COMPSCI 371 Homework 6¶

Homework Submission Workflow¶

When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.

Note on Exam-Style Problems¶

Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.

As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.

Problem 0 (3 points)¶

Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?

Note¶

This question requires no explicit answer from you, and no space is provided for an answer in the template file.

However, the rubric entry for this problem in Gradescope will be used to deduct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.

Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.

Part 1: Regularization in Logistic Regression Classification¶

A technique known as regularization was mentioned in class in connection with the Logistic Regression Classifier (LRC). We will look into this technique more, but this part of the assignment is a preview.

We use natural logarithms everywhere in this part.

A Numerical Technicality¶

We first address a technicality that has to do with the composition of the logistic function

$$ p = f(\alpha) = \frac{1}{1 + e^{-\alpha}} $$

and the cross-entropy loss

$$ \ell(y, p) = -y \log p - (1 - y) \log (1-p) $$

that is used when training the LRC.

Specifically, when $\alpha$ approaches either positive or negative infinity the score $p$ becomes very close to either 0 or 1, and either the logarithm of $p$ or that of $1-p$ then is close to being undefined. This causes numerical problems.

However, direct manipulation shows that the composition of the two functions can be simplified as follows:

$$ \ell(y, \alpha) = \log(1 + e^{-\alpha}) + (1 - y) \alpha \;. $$

In this expression, the argument to the logarithm is always greater than one, so the issue mentioned above does not arise.

Thus, it is preferable to use this last expression for the cross-entropy loss in any coding, rather than the explicit composition of $f$ and $\ell$.

Two Datasets¶

The following code retrieves a very small set scalar_data (training set only) and a small dataset ad_data (both training set and test set) adapted from a dataset in the UCI Machine Learning Repository.

In [1]:
from urllib.request import urlretrieve
from os import path as osp


def retrieve(file_name, semester='fall22', course='371', homework=6):
    if osp.exists(file_name):
        print('Using previously downloaded file {}'.format(file_name))
    else:
        fmt = 'https://www2.cs.duke.edu/courses/{}/compsci{}/homework/{}/{}'
        url = fmt.format(semester, course, homework, file_name)
        urlretrieve(url, file_name)
        print('Downloaded file {}'.format(file_name))
In [2]:
import pickle

scalar_data_file = 'non_separable.pickle'
retrieve(scalar_data_file)
with open(scalar_data_file, 'rb') as file:
    scalar_data = pickle.load(file)
    
ad_data_file = 'ad.pickle'
retrieve(ad_data_file)
with open(ad_data_file, 'rb') as file:
    ad_data = pickle.load(file)
Using previously downloaded file non_separable.pickle
Using previously downloaded file ad.pickle

The scalar_data set contains a nearly trivial training set for a binary classification problem with with one-dimensional data ($x\in\mathbb{R}^d$) and labels in $Y = \{0, 1\}$, as shown below.

The data is stored as a dictionary with keys x and y. The corresponding values are two numpy vectors of length 200.

The ad_data set will be described later.

In [3]:
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline
In [4]:
def plot_1d_data(dataset, font_size=14):
    plt.figure(figsize=(10, 3), tight_layout=True)
    plt.plot(dataset['x'], dataset['y'], '.', ms=10)
    plt.xlabel('x', fontsize=font_size)
    plt.ylabel('y', fontsize=font_size)
    plt.title('training set')
    plt.xticks(fontsize=font_size)
    plt.yticks(fontsize=font_size)
    plt.show()
In [5]:
plot_1d_data(scalar_data)

Problem 1.1¶

Pretend that we know that the optimal value of $b$ in the equation $b + wx = 0$ of the LRC separating hyperplane (this hyperplane is a single point) for scalar_data is $b=0$.

In the scalar-data case, $w$ determines only the width of the "soft step" and which half-space (either $x \leq 0$ or $x \geq 0$) is the decision region for label 1.

Plot the training risk of the LRC obtained on scalar_data with this value of $b$ and for $w$ ranging in $[-3, 6]$.

The purpose of this exercise is both to make sure you remember the definitions and to show that there is an optimal, non-trivial value of $w$.

Programming Notes¶

  • Write a function to plot the training risk for an arbitrary dataset with scalar data ( $x\in\mathbb{R}$) and binary labels. You will need this function again in a later problem.
  • Use numpy.semilogy for the plot, rather than numpy.plot. This will turn out to be useful in a later problem.
  • Label the axes meaningfully.

Problem 1.2¶

Now "spread" the data in scalar_data to make the set separable. Specifically, make a new dataset separable_data that copies scalar_data and then adds 10 to all of the $x$ values for samples where $y = 1$ and subtracts 10 from the remaining $x$ values.

Then plot the risk again for the same combinations of $b$ and $w$ values as in the previous problem.

Describe what happens and explain why you obtain this type of plot for separable data, but not necessarily for non-separable data.

Problem 1.3¶

The previous two problems show that a LRC can run into trouble during training if the training set is separable. When the data lives in more than one dimension, that is, when $\mathbf{x}\in\mathbb{R}^d$ with $d > 1$, there is an additional issue, which can be seen as an opportunity.

Specifically, there are many more degrees of freedom for the separating hyperplane when $d > 1$. The number of parameters is $d+1$ for a binary problem, and there are many ways in which the hyperplane can rotate (in addition to translating and changing its steepness) in $\mathbb{R}^d$. Different hyperplane orientations can lead to different degrees of generalization. It would be good to find ways to exploit this additional freedom.

This phenomenon is practically non-existent when $d=1$, when the number of degrees of freedom is very limited and the "orientation" of the hyperplane is fixed.

Regularization¶

To address both aspects (lack of convergence and freedom in choosing a good hyperplane orientation), a regularization term is added to the risk. Rather than just minimizing $L_T(\mathbf{v})$ as a function of the parameters $\mathbf{v} = (b, \mathbf{w})$, we minimize a linear combination of $L_T(\mathbf{v})$ and $\|\mathbf{v}\|^2$, the square of the Euclidean norm of $\mathbf{v}$. The LRC software in scikit-learn writes this combination as

$$ L_{\text{reg}}(\mathbf{v}) = C L_T(\mathbf{v}) + \|\mathbf{v}\|^2 $$

where $C$ is a parameter to be determined. So there is stronger and stronger regularization as $C\rightarrow 0$. Regularization cannot be turned off, but it can be made insignificant by selecting a very large $C$, say, 10,000.

Why Regularization Helps¶

It is obvious why regularization helps with the first issue, lack of convergence for separable data, since it prevents the parameters from growing indefinitely. It is less obvious why regularization helps with the second issue. Why would smaller parameters lead to better generalization?

A simple answer is that they really don't, and what helps instead is the combination of an additional hyper-parameter $C$ and a procedure, cross-validation, that helps pick a value of $C$ that matches the dataset well: If the parameter vector $\mathbf{v}$ is best left unconstrained, cross-validation will return a large value of $C$. If there happens to be a value of $\|\mathbf{v}\|^2$ that leads to better generalization, then cross-validation will come up with a value of $C$ that has the effect of returning a parameter vector $\mathbf{v}$ with the desired norm. So you can see regularization in this case as provding an additional degree of freedom (controlled by the parameter $C$) in the choice of a separator that generalizes well.

The Ad Dataset¶

We use the ad dataset retrieved earlier to explore this point. This dataset has dimensionality $d=1430$. The variable ad_data is a dictionary with keys 'train' and 'test', and the value for each key is a dictionary with keys 'x' and 'y' corresponding to arrays of shape $n\times 1430$ and $n$ respectively, for some $n$ (1179 training samples and 1180 test samples).

If you want to understand the meaning of this data, you can explore that on the web site that hosts the original data from which this set was adapted. The classification problem is again binary with labels in $Y = \{0, 1\}$, so the classifier has signature $\mathbb{R}^{1430}\rightarrow Y$. This is really all you need to understand for this problem.

Here is a simplified version of the function train_and_test from homework 5. As you see from the results, the LRC has quite good performance but overfits a bit, since the test accuracy is lower than the training accuracy by a few percentage points.

In [6]:
from sklearn.linear_model import LogisticRegression


def train_and_test(data, cs=10000, folds=None, max_iter=10000):
    h = LogisticRegression(C=cs, max_iter=max_iter)
    h.fit(data['train']['x'], data['train']['y'])
    for dataset in ('train', 'test'):
        accuracy = h.score(data[dataset]['x'], data[dataset]['y'])
        print('{} empirical_accuracy is {:.2f} percent'.format(dataset, accuracy * 100))
    return h
In [7]:
lrc = train_and_test(ad_data)
train empirical_accuracy is 99.41 percent
test empirical_accuracy is 94.83 percent

As you see, the optional parameter folds is not used in the function definition for now.

Your Task¶

Your job is to see if choosing a value for the regularization parameter $C$ by cross-validation can help, and then analyze any improvement for the ad_data.

Specifically, first add code (an if statement and a call to sklearn.LogisticRegressionCV) to train_and_test so that it does the following:

  • If cs is a single scalar, train_and_test uses no cross-validation, and uses the current call to LogisticRegression.
  • Otherwise, train_and_test expects cs to be a numpy vector of floating-point values. In that case, it uses cross-validation with sklearn.LogisticRegressionCV and the specified number of folds to determine the hyper-parameter $C$ in the expression for $L_{\text{reg}}(\mathbf{v})$ among the choices in cs.

Then write a function with header

def analyze_cv(h):

This function takes an LRC h trained with sklearn.LogisticRegressionCV and examines the following attributes:

best_c, c_values, scores = h.C_[0], h.Cs_, h.scores_[1]
best_index = np.where(c_values == best_c)[0][0]

Here,

  • best_c is the final value of $C$ found through cross-validation.
  • c_values is the same as the array cs passed as Cs to sklearn.LogisticRegressionCV.
  • scores is a numpy array with folds rows and len(c_values) columns that contains the validation accuracy for each value of $C$ in each fold.
  • best_index tells which column of scores (and entry of c_values) corresponds to best_c.

The function analyze_cv first prints out a message with the best value of $C$ that h found. It then uses matplotlib.pyplot.errorbar to plot the mean validation score for each value of $C$ over all folds. The error bar for each value is the standard deviation of the validations scores for that value over all folds.

The function analyze_cv then adds a (visible!) red dot at (best_c, m) where m is the mean validation score for best_c.

Show your code and the result of the following calls (uncommented):

In [8]:
# c_candidates = np.logspace(-3, 6, num=10)
# lrc = train_and_test(ad_data, c_candidates, folds=5)
# analyze_cv(lrc)

Finally, comment briefly on the results: Did cross-validation help?

Programming Notes¶

  • Read the manual entry for sklearn.LogisticRegressionCV to get a sense for how it works, and make sure your code imports that function.
  • When sklearn.LogisticRegressionCV is called with cv=None, it uses 5-fold cross-validation. Otherwise, it uses the specified number of folds. So you can just pass cv=folds to that function.
  • Pass max_iter=max_iter to sklearn.LogisticRegressionCV and always use the default value in train_and_test.
  • It is OK to leave scores as fractions of 1, rather than percentages.
  • Label the axes of your plot meaningfully.
  • Cross-validation is expensive, so be patient.

Part 2: The Bootstrap¶

The bootstrap is a resampling method somewhat analogous to cross-validation. In class we discussed this method in the context of hyper-parameter estimation for some predictor $h$. Given a set $T$ of $N$ training samples, iteration $k$ of the bootstrap creates a training bag $T_k$ by sampling $N$ items out of $T$ with replacement. The items that did not make it into $T_k$ form the validation set $V_k$ for that iteration. The method then computes an estimate of the prediction risk $L_k$ by training $h$ on $T_k$ and evaluating its risk on $V_k$. The output from the entire procedure is an estimate of the validation risk and a measure of uncertainty around it in the form of a standard deviation, as explained next.

Uncertainty around the Risk Estimate¶

A key advantage of resampling methods for machine learning is that they yield not just a risk estimate $\hat{L}$ (the sample average of the risks $L_k$), but also a measure of uncertainty $\hat{\sigma}_{\hat{L}}$ around the estimate (the sample standard deviation of the risks $L_k$). If $\hat{\sigma}_{\hat{L}}$ is too large (perhaps compared to $\hat{L}$ itself), we know that we are not using enough data.

Note the two hats in $\hat{\sigma}_{\hat{L}}$: The one on $L$ says that this is a measure of uncertainty about our estimate of the risk. The hat on $\sigma$ says that this measure is not the true uncertainty, but rather an estimate of it.

Why Bags?¶

The use of resampling with replacement makes $T_k$ a bag (i.e., a multiset), rather than a set. This choice is a bit odd, and was not justified in class: Why make $T_k$ a bag rather than a set, which would be obtained if data points were drawn out of $T$ without replacement?

This part of the assignment shows empirically that resampling without replacement would lead to a poor estimate of the uncertainty $\hat{\sigma}_{\hat{L}}$. Of course, this could also be proven theoretically. See for instance the seminal book by B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap, Chapman and Hall, 1993.

A Simplified Scenario¶

Estimating validation risks is a relatively complex application of the bootstrap. To distil the essence of the method we consider a simpler estimation scenario, in which we only have bags $T_k$ and no sets $V_k$.

Specifically, suppose that you work for a company and you are asked to determine the average wait time for people who call your company's customer service center. You measure the wait time for $N$ calls and store the times $t_n$ in seconds, for $n=1,\ldots, N$ in a set $T$.

The numpy vector times below is that set, with $N=100$.

In [9]:
times_file = 'times.pickle'
retrieve(times_file)
with open(times_file, 'rb') as file:
    times = pickle.load(file)
n = len(times)
Using previously downloaded file times.pickle

You remember from your statistics course that the sample average $$ \hat{m}_N = \frac{1}{N} \sum_{n=1}^N t_n $$ from a random sample is an unbiased estimate of the true mean $m$ of the underlying population (the set of all waiting times).

Note on Terminology¶

In machine learning, we used the word "sample" to refer to a single item $(\mathbf{x}, y)$ out of a set of such items (for instance, a training set). In statistics, onthe other hand, the word "sample" is short for "random sample" and typically refers to a set of items drawn out of a distribution. The set $T$ in this part is a "(random) sample" in this sense of the term. We will use this terminology in this part of the assignment.

In [10]:
m_hat = np.mean(times)
print('estimated wait time {:.4f} seconds'.format(m_hat))
estimated wait time 13.1083 seconds

So you go to your boss and tell her that you determined that the average wait time for your company's customers is 13.1083 seconds.

She looks a bit puzzled: "Are you sure? And is your estimate so good that it deserves four decimal digits?"

Problem 2.1¶

To answer your boss's last question, use the bootstrap to give both an estimate $\hat{m}_{\hat{m}_b}$ of the mean wait time and an estimate $\hat{\sigma}_{\hat{m}_b}$ of the standard deviation of $\hat{m}_b$.

Specifically, write a function with header

def bootstrap(data, sample_size=None, iterations=10000, replace=True):

that takes the following inputs:

  • A numpy array data of floating point numbers (such as times defined above).
  • An optional integer sample_size that tells how large a sample to draw out of data. If sample_size is None, your function should set sample_size to len(data).
  • An optional number of bootstrap iterations. Always use the default value in your experiments.
  • An optional flag replace that is True if and only if samples are to be drawn out of data with replacement.

The function returns the sample mean and sample standard deviation of the sample means $\hat{m}_k$ computed on bags $T_k$ drawn out of data, for $k = 0,\ldots,$ iterations$ - 1$. All bags have bag cardinality sample_size, and whether items are drawn with or without replacement depends on the value of replace.

Show your code and print out the two outputs from

m_b, std_b = bootstrap(times)

using four decimal digits after the period. Your printout should show clearly which value is which.

Programming Notes¶

  • The function numpy.random.choice draws out of a set with replacement if its optional parameter replace is set to True (which is the default value). Make sure you pass replace=replace to this function.
  • Note that there are two sample means here. First, compute the sample mean of the values in each $T_k$. Then, compute the sample mean (and the sample standard deviation) of all these sample means.

Problem 2.2 (Exam Style)¶

How many decimal digits will you use next time you report your estimate of the customer wait time to your boss, and why?

Problem 2.3¶

As you can imagine, the data in times for this problem was not collected from actual customer phone calls, but was drawn from an exponential distribution with mean $\lambda = 12.5$ seconds instead. This means that the probability density from which the sample was drawn is

$$ p(t) = \left\{\begin{array}{ll} \frac{e^{-t/\lambda}}{\lambda} & \mbox{for $t\geq 0$}\\ 0 & \mbox{otherwise} \end{array}\right. $$

This density has mean $m = \lambda$ and standard deviation $\sigma = \lambda$. Of course, these are true statistical moments, not empirical ones: They come from knowledge of the true underlying distribution, not from measurements on a sample.

From statistics, we know that the average $\hat{m}_N$ of a random sample of $N$ numbers is unbiased, as mentioned earlier, so the true mean of $\hat{m}_N$ is

$$ m_{\hat{m}_N} = m $$

where $m$ is the true (statistical) mean of the distribution $p(t)$.

We also know from statistics that the true standard deviation of such a sample is

$$ \sigma_{\hat{m}_N} = \frac{\sigma}{\sqrt{N}} $$

where $\sigma$ is the true (statistical) standard deviation of the distribution $p(t)$.

Note: It is actually rather straightforward to derive these two results by direct calculation.

Note carefully that there are two true statistical means at play here: One is $m$, the true mean of the distribution $p(t)$ from which the sample was drawn. The other is $m_{\hat{m}_N}$, the true mean of a sample of $N$ items out of that distribution.

The mean $m$ answers the following question: If I draw a single item out of $p(t)$, what is its value on average?

The mean $m_{\hat{m}_N}$ answers the following question: If I draw N items out of $p(t)$ and compute their empirical mean $\hat{m}_N$, what is the value of $\hat{m}_N$ on average?

These two true means just happen to be numerically equal, because both sample averaging and statistical expectation are linear operations, so the statistical mean of a random variable is equal to the statistical mean of an empirical average of random variables drawn from the same distribution.

Your Task¶

Print out the following three numbers on three separate lines, using four decimal digits after the period:

  • The true mean $m$ of $p(t)$
  • The empirical average $\hat{m}_T$ of the entire set times
  • The empirical average m_b $ = \hat{m}_{\hat{m}_b}$ computed by bootstrap.

Your printout should make clear which number is which.

Problem 2.4 (Exam Style)¶

Two of the numbers you printed in the previous problem are more similar to each other than to the third. Explain why the two similar numbers are similar to each other and why they are (slightly) different from the third.

Problem 2.5¶

The standard deviation (statistical or empirical) is not a linear operation, so we should not expect the true standard deviations $\sigma$ and $\sigma_{\hat{m}_N}$ to be the same, nor should we expect their empirical estimates to be the same. Similarly to what we saw for the means, we can make the following considerations.

  • The true statistical standard deviation $\sigma$ of $p(t)$ answers the following question: If I repeatedly draw a single item out of $p(t)$, how much does its value vary across repetitions on average?

  • The true statistical standard deviation $\sigma_{\hat{m}_N}$ answers the following question: If I repeatedly draw bags of N items out of $p(t)$ and compute the empirical mean $\hat{m}_N$ of each bag, how much does the value of $\hat{m}_N$ vary across repetitions on average?

Thus, the two standard deviations measure different things.

Print out the following four numbers on four separate lines, using four decimal digits after the period:

  • The true standard deviation $\sigma$ of $p(t)$.
  • The empirical standard deviation $\hat{\sigma}_{\hat{m}_T}$ of the values in the entire set times.
  • The true standard deviation $\sigma_{\hat{m}_N}$ of a sample of N numbers out of $p(t)$.
  • The empirical standard deviation std_b $ = \hat{\sigma}_{\hat{m}_N}$ computed by bootstrap. Your printout should make clear which number is which.

Problem 2.6¶

From the formula, the true standard deviation $\sigma_{\hat{m}_N}$ is expected to decrease with increasing $N$. Verify that so does its bootstrap estimate by plotting both $\sigma_{\hat{m}_k}$ and $\hat{\sigma}_{\hat{m}_k}$ (the latter from running bootstrap) on the first $k$ items in times and for $k = 10, 20, \ldots, 100$.

Programming Notes¶

  • Wrap the plotting part of your code in a function, so you can reuse it later. The function takes three vector arguments with the values of $k$, those of $\sigma_{\hat{m}_k}$, and those of $\hat{\sigma}_{\hat{m}_k}$ respectively.
  • Use the default value for the parameter iterations of bootstrap.
  • Pass replace=replace to bootstrap.
  • Your plot should be a single diagram with two curves and a legend that says which curve is for which standard deviation.

Problem 2.7 (Exam Style Except for the Code)¶

Repeat the experiment in the previous problem calling bootstrap as follows for each $k$:

bootstrap(times, sample_size=k, replace=False)

In this way, instead of drawing $k$ items with replacement out of times to obtain a bag $T_k$ of size $N$ (same size as times), bootstrap draws $k$ items without replacement, so that $T_k$ is now a proper set of $T$ of cardinality $k$.

Show your plot, describe what you see, and try to explain why things look worse. In particular, why does this version of the bootstrap underestimate the true values more and more for increasing $k$?