assignment05

Note on Exam-Style Problems¶

Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.

As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.

Problem 0 (3 points)¶

Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?

Note¶

This question requires no explicit answer from you, and no space is provided for an answer in the template file.

However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.

Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.

Part 1: Convex Functions¶

By "convex" we mean "convex everywhere" in this part of the assignment.

Define two functions as follows:

$f$ is a weakly convex function from $\mathbb{R}^d$ into a convex subset $A$ of $\mathbb{R}$;
$g$ is a weakly convex and nondecreasing function from $\mathbb{R}$ into $\mathbb{R}$.

Define the new function $h : \mathbb{R}^d \rightarrow \mathbb{R}$ as the composition of $f$ and $g$:

$$ z = h(\mathbf{x}) = g(y) \;\;\;\text{where}\;\;\; y = f(\mathbf{x})\;. $$

We also use the affine function

$$ \phi(\mathbf{x}) = b + \mathbf{w}^T \mathbf{x} $$

from $\mathbb{R}^d$ to $\mathbb{R}$ in this part of the assignment.

Problem 1.1 (Exam Style)¶

Is the function $\phi(\mathbf{x})$ given above weakly convex? Prove your answer.

Hints¶

Proving your answer entails showing that it derives from the definitions of weak convexity and of $\phi$.
First write what equality or inequality you want to prove or disprove, then construct your argument that shows that the equality or inequality in question holds or does not hold.
If constructing your argument seems difficult starting from one side of the equality or inequality, try starting form the other side.

Problem 1.2 (Exam Style)¶

Is the function $\phi$ defined in the preamble above strongly convex? Weakly concave? Just answer each question. No proof required.

Problem 1.3 (Exam Style)¶

Give a simple example that shows that if we remove the requirement that $g$ is nondecreasing then the function $h$ constructed in the preamble above is not necessarily even weakly convex.

Hint¶

Start with the simplest weakly convex function and the simplest strongly convex function you can think of. Can you see ways to tweak and compose these two functions that will make the composition nonconvex (even concave)?

Problem 1.4 (Exam Style)¶

Prove that the function $h$ constructed in the preamble above is weakly convex without assuming that either $f$ or $g$ is differentiable.

Notes and Hints¶

Proving this statement involves showing that it derives from the definitions of (weak, global) function convexity and set convexity.
Function convexity is defined in the class notes. We discussed set convexity, a concept you studied in calculus, during lecture. Feel free to look up that concept.
As usual, looking around for the proofs you are asked to develop would be both cheating and self-defeating, as similar proofs may come up in exams.
A function $g(y)\ :\ \mathbb{R} \rightarrow \mathbb{R}$ is nondecreasing if for all real numbers $y$ and $y'$ such that $y < y'$ we have $g(y) \leq g(y')$.
Write down the inequalities that express the assumptions made on $f$, $g$, and $A$, and then write down the inequality that expresses what you want to prove. Then think how to derive the latter from the former.
If your proof involves two points in the domain of $g$, you cannot assume that some linear combination of those two points is also in the domain of $g$. Instead, you need to prove that. This is where set convexity comes in.
Notation may become unwieldy very soon, given the many functions and inequalities involved. Define intermediate symbols whenever useful to keep things under control. For instance, I found it useful to define $u = f(\alpha \mathbf{x} + (1 - \alpha) \mathbf{x}')$ and $v = \alpha y + (1 - \alpha) y'$ where $y = f(\mathbf{x})$ and $y' = f(\mathbf{x}')$. Different choices may be useful as well depending on how you construct your proof.
Your initial reasoning may go back and forth between concepts and inequalities until you get it right. However, please rearrange your argument into something clean and linear as you write down the final version.
Remember to break up your proof into several notebook cells for easier pagination.

Part 2: Score and Loss Functions¶

In this part we explore basic properties of the combination of score function and loss function for the binary Logistic-Regression Classifier (LRC) with label set $Y = \{0, 1\}$.

The data score function is

$$ s(\mathbf{x}) = f(a(\mathbf{x})) $$

where

$$ a(\mathbf{x}) = b + \mathbf{w}^T \mathbf{x} $$

is the activation function and $f$ is some function of the activation. In these expressions, the data points $\mathbf{x}$ is a vector in $\mathbb{R}^d$ for some fixed positive integer $d$.

The classifier is then

$$ h(\mathbf{x}) = \left\{\begin{array}{ll} 1 & \mbox{if $s(\mathbf{x}) > 1/2$} \\ 0 & \mbox{otherwise.} \end{array}\right. $$

This function depends on the parameters $\mathbf{v} = (b, \mathbf{w})$ of the activation function $a(\mathbf{x})$.

Note on Terminology¶

In the literature and in the class notes, the term "score function" is sometimes used to denote two different functions:

The function $f$ that, given the activation value $\alpha = a(\mathbf{x})$ for a particular training data point $\mathbf{x}$ returns the score $f(\alpha)$ for that sample.
The composition $s$ of the functions $a$ and $f$.

The same name is used for these two functions because they return the same value, namely the score for $\mathbf{x}$. However, the signatures of the two functions are different. The function $f$ takes the activation value $\alpha = a(\mathbf{x})$ (a real number) as input while the function $s$ directly takes the data point $\mathbf{x}$ (a vector of $d$ numbers) as input.

To avoid confusion between the two functions in this assignment we use the term data score function for $s$ and activation score function for $f$. Thus, the data score function is the composition of the activation function with the activation score function.

Training, Loss, and Risk¶

The classifier $h$ is trained on the training set

$$ T = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\} $$

by solving the following optimization problem:

$$ \hat{b}, \hat{\mathbf{w}} \in \arg\min_{b, \mathbf{w}} L_T(b, \mathbf{w}) $$

where the risk $L_T$ is defined as follows:

$$ L_T(b, \mathbf{w}) = \frac{1}{N} \sum_{n=1}^N \ell(y_n, s(\mathbf{x}_n))\;. $$

In the last expression, $\ell(y, p)$ is called the score loss function when viewed as a function of true label $y$ and data score $p = s(\mathbf{x})$.

For clarity, we call the composition

$$ \ell(y, s(\mathbf{x})) $$

the sample loss function in this assignment. Thus, the score loss function is a function of two real-valued variables while the sample loss function is defined on $Y\times\mathbb{R}^d$.

Two Particular Activation Score Functions¶

In this part we consider two choices for a possible activation score function $f$:

The identity

$$ f_1(\alpha) = \alpha $$

The logistic function:

$$ f_2(\alpha) = \frac{1}{1+e^{-\alpha}} $$

We saw in class that the identity score function is not a great idea. See Figure 2 of the notes on linear predictors, part 1. In that figure, $d = 1$ (that is, the data space is one-dimensional) and the data score $s(x)$ is an affine function of $x$, that is, $s(x) = b + wx$. In the notation used in this assignment, the activation is $\alpha = a(x) = b + wx$, and therefore the activation score function is $f(\alpha) = \alpha$, the identity function.

We then preferred the logistic function for a variety of reasons. This part explores convexity, or lack thereof, of the loss/score combination as one of those reasons.

Two Particular Score Loss Functions¶

We also consider two choices for a possible score loss function $\ell(y, p)$ where $p = s(\mathbf{x})$ is the score for data point $\mathbf{x}$:

The quadratic loss

$$ \ell_1(y, p) = (y - p)^2 $$

The cross-entropy loss

$$ \ell_2(y, p) = \left\{\begin{array}{ll} -\log p & \mbox{if $y = 1$} \\ -\log(1 - p) & \mbox{if $y = 0$} \end{array}\right. $$

which can be written more compactly as follows:

$$ \ell_2(y, p) = -y\log p - (1 - y) \log(1-p) \;. $$

We use natural logarithms in this assignment.

Problem 2.1 (Exam Style)¶

For each of the functions $f_1$, $f_2$, $\ell_1$, $\ell_2$ defined above state whether the function is convex everywhere, concave everywhere, or neither; and whether it is increasing everywhere, decreasing everywhere, or neither.

For $f_1$ and $f_2$ it is obvious what this means, as these functions depend on a single scalar variable $\alpha$. For $\ell_1$ and $\ell_2$ consider these as functions of $p$. Considering these as functions of $y$ would make little sense, since $y$ is a label, and therefore a categorical value. On the other hand, $p$ is a score value, which is a real number.

It is possible for a score loss function to have one property for $y=0$ and another property for $y=1$, so you should give an answer for each value of $y$ in that case.

Be as specific as you can in your answers. For instance, if a function is strongly convex it is also weakly convex, but we want "strongly convex" as an answer, since this is more specific than "weakly convex." Similarly for "decreasing" (more specific) versus "non-increasing" (less specific) and the like. If a function is both weakly convex and weakly concave, so state.

No proofs are needed, and you may leave the word "everywhere" implicit in your answers.

Problem 2.2 (Exam Style)¶

Given the choices above, there are up to four possible combinations of activation score function and score loss function. For each combination give a single formula for the corresponding sample loss function in terms of $y$ and $\alpha$. Do not expand $\alpha$ to $a + \mathbf{w}^T\mathbf{x}$ to keep the expressions simpler.

If a combination makes no sense mathematically (in that the two functions cannot be combined by composition), write undefined rather than the formula. Each of your formulas should be written in the form

$$ \ell_i(y, f_j(\alpha)) = \ldots $$

where $i\in\{1, 2\}$ and $j\in\{1, 2\}$.

Make your formulas as simple as you can notationally.

In addition, for each of the combinations that are not undefined, state if the combined function is globally convex in $\alpha$ or not. No need to prove your statement, but make sure you know why you give the answers you give.

Plot the score loss functions in problem 2.2 that are not undefined, each in its own notebook cell. The independent variable is $\alpha$ in $[-5, 5]$. Each plot should have a curve for $y=0$ and one for $y=1$.

Put a legend that states which curve is which ($y=0$ or $y=1$) and a title that states which score loss function you are plotting. Axis labels are optional.

Problem 2.4 (Exam Style)¶

Let's pick the plots in the previous problem for $y = 0$. If correct, these plots should suggest that the sample loss functions they depict behave rather differently when $\alpha\rightarrow\infty$. All these functions asymptote to polynomials. That is, they can be approximated by $\alpha^k$ for some $k \geq 0$ as $\alpha\rightarrow\infty$. Give the value of $k$ for all the combinations that are not undefined in problem 2.2.

Specify which value of $k$ is for which sample loss function by giving your answers in the form

$$ \ell_i(y, f_j(\alpha)) \rightarrow \alpha^k $$

for appropriate values of $i, j, k$. Then state which function is most or least sensitive to values that are classified correctly and are very far from the decision boundary. Assume that the form of the activation $a(\mathbf{x})$ is the same for all samples. No explanation needed.

Hints¶

Replace $y=0$ in each formula and see what can be neglected when $\alpha$ is very large.
A function is more "sensitive" to a sample if it changes more as the data value $\mathbf{x}$ of the sample is perturbed by a small amount.

Problem 2.5 (Exam Style)¶

One of the functions in problem 2.2 is sensitive to samples far from the boundary regardless of whether they are correctly or incorrectly classified. Which one, and how do you know?

Part 3: The Multiclass Logistic-Regression Classifier¶

The Classifier¶

The class sklearn.linear_model.LogisticRegression implements a Multiclass Logistic-Regression Classifier (MLRC). Please read the documentation of this function online. We will use it with default values for all parameters with the lone exception of multi_class='multinomial'. So you instantiate the classifier with the call

h = LogisticRegression(multi_class='multinomial')

and then train it on training set x, y with

h.fit(x, y)

the training set has $n$ samples,
the data points live in $\mathbb{R}^d$, and
the classification problem has $k$ different labels,

then x is an $n$ by $d$ numpy array of floats and y is a numpy vector with $n$ integers between $0$ and $k-1$.

Of the various attributes of LogisticRegression you will only need score, so please read the documentation for that.

Important: The attribute score is not the same as the score function we talked about in class. That would be the attribute predict_proba, which we however will not need for this assignment. Instead, score takes a dataset and returns the classifier's accuracy on that dataset. The accuracy is the fraction of samples in the given dataset that the classifier predicts correctly. It is returned as a number between 0 and 1.

The Vowel Dataset¶

The MNIST dataset of handwritten digits is often used as a toy problem in machine learning courses because it is small and nontrivial. You played with this dataset in homework 4, whether you realized that or not.

However, the MNIST dataset is a very "easy" classification problem, in that it is not hard to achieve very high accuracies on both training and test data with very simple classifiers.

Instead, this part of this assignment will use the Vowel Dataset, which is described here. This dataset poses a very hard classification problem in two ways: It is difficult to obtain decent results with simple classifiers and the dataset is rather small, making it impossible to use classifiers with very large numbers of parameters. Using a difficult set in this way (i) gives you a more realistic picture of what machine learning can be about and (ii) is an example of a case where simple classifiers are really the only option, whether they work well or not. You should not expect very high accuracies in this part.

The vowel dataset was constructed as follows (you can read more in the document linked to above).

Researchers in phonetics made a list of eleven words each of which contains one of eleven distinct vowel sounds in British English. Here are the words (OK, one of them isn't actually a word):

heed, hid, head, had, hard, hud, hod, hoard, hood, who'd, heard

They then asked fifteen speakers to speak these words, once per word per speaker. Eight speakers were used to build the training set and seven speakers, different from the first eight, were used to build the test set. The researchers recorded the words and extracted six short audio clips from the wovel part of each word. The total number of clips is therefore as follows:

$8 \times 11 \times 6 = 528$ training clips
$7 \times 11 \times 6 = 462$ test clips

Each clip was then processed with some fancy signal processing that resulted in ten floating point numbers that somehow describe the main phonetic characteristics of the clip. We do not need to understand what these characteristics are: This is the advantage of converting a complicated input into a fixed-length vector, remember?

Suffice it to say that, in our customary notation, a sample $(\mathbf{x}, y)$ (either training or test) has $\mathbf{x}\in X\subseteq\mathbb{R}^{10}$ (ten numbers per clip) and $y\in Y = \{0, 1, 2, \ldots, 10\}$ (one of eleven vowel IDs).

Data Format¶

As usual, let us download some data and code:

In [1]:

from urllib.request import urlretrieve
from os import path as osp


def retrieve(file_name, semester='fall22', course='371', homework=5):
    if osp.exists(file_name):
        print('Using previously downloaded file {}'.format(file_name))
    else:
        fmt = 'https://www2.cs.duke.edu/courses/{}/compsci{}/homework/{}/{}'
        url = fmt.format(semester, course, homework, file_name)
        urlretrieve(url, file_name)
        print('Downloaded file {}'.format(file_name))

In [2]:

retrieve('vowel.train')
retrieve('vowel.test')
retrieve('vowels.py')

Using previously downloaded file vowel.train
Using previously downloaded file vowel.test
Using previously downloaded file vowels.py

In [3]:

from vowels import vowel_dataset
import numpy as np

dataset = vowel_dataset()

The dataset is a dictionary with keys train, test, and label meaning.

The values corresponding to train and test are the training set and the test set, each represented as a dictionary with keys x and y with the data in the format explaiend earlier.
The list m = dataset['label meaning'] is a list of the eleven words used in the experiment. Entry m[0] corresponds to label 0, entry m[1] to label 1, and so forth.

Write a function with header

def train_and_test(data, split='original'):

that takes a dataset like dataset, instantiates a LogisticRegression class and fits it to the training set. Then, the function prints the accuracy measured on both the training set and the test set. Each accuracy value should be a percentage, that is, a number between 0 and 100, displayed with two decimals after the period. It should be printed on a separate line that explains which accuracy is being printed. For instance,

training accuracy on the original split is xx.xx percent

The word original in this line is not hard-wired. Instead, it is taken from the value of the optional parameter split. Don't worry what this means for now. You will use this function also in a later problem where the data will be split in a different way. The split in dataset is the "original" split, so for now you will use the default value of split.

Finally, the function returns the (trained) classifier.

Show your code and what it prints. Call your function as

classifier = train_and_test(dataset, 'original')

so that classifier can be used later on as well.

Problem 3.2 (Exam Style)¶

Does the classifier generalize well? Explain your answer very briefly.

One way to get some insight about the performance of a multi-class classifer is to see if some classes are harder to recognize than others. Write a function with header

def print_difficult_vowels(h, data):

that takes a classifier and a dataset like dataset and for each of training set and test set computes (without printing it) the accuracy for each vowel: What is the fraction of samples whose true label is the first vowel that the classifier gets right? And for the second vowel? And so forth.

For each of the two sets (training and test) the function prints which vowel has the lowest accuracy and the value of that accuracy. So the function prints out two lines with the following format:

'lowest {} accuracy is {:.2f} percent on the vowel in "{}" '

where the placeholders specify which set we are talking about (training or test), the accuracy (again, as a percentage) and the word that contains that vowel.

Show your code and what it prints out for classifier and dataset as defined earlier.

The vowel dataset is very unusual in the landscape of machine learning datasets because the test set is radically different from the training set: The speakers in the two sets are different.

In contrast, most datasets for machine learning research split a single dataset into training and test set at random. Let us see how that works in this problem.

Write a function with header

def random_split(data):

that takes a dataset like dataset, merges training and test data, and splits the merged data again into training and test sets uniformly at random, with each set containing about half of the samples (n // 2 in the training set and the rest in the test set).

The function should return a dictionary of the same format as dataset. Don't forget to copy the list of label meanings.

Show your code and the printout resulting from running test_and_train on the new split that your function generates from dataset.

Problem 3.5 (Exam Style)¶

Results for the previous experiment may vary from run to run since the data split is random. You may want to run the experiment a few times to get a sense of this variability. However, do not show those multiple runs.

Compare the results from Problems 3.1 and 3.4 and comment on the following aspects:

What might explain any large differences in training and/or test accuracy between the two experiments?
In the light of this comparison, why might the promising results you often read about in the machine learning literature sometimes be misleading?

COMPSCI 371 Homework 5¶

Homework Submission Workflow¶

Note on Exam-Style Problems¶

Problem 0 (3 points)¶

Note¶

Part 1: Convex Functions¶

Problem 1.1 (Exam Style)¶

Hints¶

Problem 1.2 (Exam Style)¶

Problem 1.3 (Exam Style)¶

Hint¶

Problem 1.4 (Exam Style)¶

Notes and Hints¶

Part 2: Score and Loss Functions¶

Note on Terminology¶

Training, Loss, and Risk¶

Two Particular Activation Score Functions¶

Two Particular Score Loss Functions¶

Problem 2.1 (Exam Style)¶

Problem 2.2 (Exam Style)¶

Problem 2.3¶

Problem 2.4 (Exam Style)¶

Hints¶

Problem 2.5 (Exam Style)¶

Part 3: The Multiclass Logistic-Regression Classifier¶

The Classifier¶

The Vowel Dataset¶

Data Format¶

Problem 3.1¶

Problem 3.2 (Exam Style)¶

Problem 3.3¶

Problem 3.4¶

Problem 3.5 (Exam Style)¶