When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.
Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.
As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.
Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?
This question requires no explicit answer from you, and no space is provided for an answer in the template file.
However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.
Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.
By "convex" we mean "convex everywhere" in this part of the assignment.
Define two functions as follows:
Define the new function $h : \mathbb{R}^d \rightarrow \mathbb{R}$ as the composition of $f$ and $g$:
$$ z = h(\mathbf{x}) = g(y) \;\;\;\text{where}\;\;\; y = f(\mathbf{x})\;. $$We also use the affine function
$$ \phi(\mathbf{x}) = b + \mathbf{w}^T \mathbf{x} $$from $\mathbb{R}^d$ to $\mathbb{R}$ in this part of the assignment.
Is the function $\phi(\mathbf{x})$ given above weakly convex? Prove your answer.
Is the function $\phi$ defined in the preamble above strongly convex? Weakly concave? Just answer each question. No proof required.
Give a simple example that shows that if we remove the requirement that $g$ is nondecreasing then the function $h$ constructed in the preamble above is not necessarily even weakly convex.
Start with the simplest weakly convex function and the simplest strongly convex function you can think of. Can you see ways to tweak and compose these two functions that will make the composition nonconvex (even concave)?
Prove that the function $h$ constructed in the preamble above is weakly convex without assuming that either $f$ or $g$ is differentiable.
In this part we explore basic properties of the combination of score function and loss function for the binary Logistic-Regression Classifier (LRC) with label set $Y = \{0, 1\}$.
The data score function is
$$ s(\mathbf{x}) = f(a(\mathbf{x})) $$where
$$ a(\mathbf{x}) = b + \mathbf{w}^T \mathbf{x} $$is the activation function and $f$ is some function of the activation. In these expressions, the data points $\mathbf{x}$ is a vector in $\mathbb{R}^d$ for some fixed positive integer $d$.
The classifier is then
$$ h(\mathbf{x}) = \left\{\begin{array}{ll} 1 & \mbox{if $s(\mathbf{x}) > 1/2$} \\ 0 & \mbox{otherwise.} \end{array}\right. $$This function depends on the parameters $\mathbf{v} = (b, \mathbf{w})$ of the activation function $a(\mathbf{x})$.
In the literature and in the class notes, the term "score function" is sometimes used to denote two different functions:
The same name is used for these two functions because they return the same value, namely the score for $\mathbf{x}$. However, the signatures of the two functions are different. The function $f$ takes the activation value $\alpha = a(\mathbf{x})$ (a real number) as input while the function $s$ directly takes the data point $\mathbf{x}$ (a vector of $d$ numbers) as input.
To avoid confusion between the two functions in this assignment we use the term data score function for $s$ and activation score function for $f$. Thus, the data score function is the composition of the activation function with the activation score function.
The classifier $h$ is trained on the training set
$$ T = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\} $$by solving the following optimization problem:
$$ \hat{b}, \hat{\mathbf{w}} \in \arg\min_{b, \mathbf{w}} L_T(b, \mathbf{w}) $$where the risk $L_T$ is defined as follows:
$$ L_T(b, \mathbf{w}) = \frac{1}{N} \sum_{n=1}^N \ell(y_n, s(\mathbf{x}_n))\;. $$In the last expression, $\ell(y, p)$ is called the score loss function when viewed as a function of true label $y$ and data score $p = s(\mathbf{x})$.
For clarity, we call the composition
$$ \ell(y, s(\mathbf{x})) $$the sample loss function in this assignment. Thus, the score loss function is a function of two real-valued variables while the sample loss function is defined on $Y\times\mathbb{R}^d$.
In this part we consider two choices for a possible activation score function $f$:
We saw in class that the identity score function is not a great idea. See Figure 2 of the notes on linear predictors, part 1. In that figure, $d = 1$ (that is, the data space is one-dimensional) and the data score $s(x)$ is an affine function of $x$, that is, $s(x) = b + wx$. In the notation used in this assignment, the activation is $\alpha = a(x) = b + wx$, and therefore the activation score function is $f(\alpha) = \alpha$, the identity function.
We then preferred the logistic function for a variety of reasons. This part explores convexity, or lack thereof, of the loss/score combination as one of those reasons.
We also consider two choices for a possible score loss function $\ell(y, p)$ where $p = s(\mathbf{x})$ is the score for data point $\mathbf{x}$:
which can be written more compactly as follows:
$$ \ell_2(y, p) = -y\log p - (1 - y) \log(1-p) \;. $$We use natural logarithms in this assignment.
For each of the functions $f_1$, $f_2$, $\ell_1$, $\ell_2$ defined above state whether the function is convex everywhere, concave everywhere, or neither; and whether it is increasing everywhere, decreasing everywhere, or neither.
For $f_1$ and $f_2$ it is obvious what this means, as these functions depend on a single scalar variable $\alpha$. For $\ell_1$ and $\ell_2$ consider these as functions of $p$. Considering these as functions of $y$ would make little sense, since $y$ is a label, and therefore a categorical value. On the other hand, $p$ is a score value, which is a real number.
It is possible for a score loss function to have one property for $y=0$ and another property for $y=1$, so you should give an answer for each value of $y$ in that case.
Be as specific as you can in your answers. For instance, if a function is strongly convex it is also weakly convex, but we want "strongly convex" as an answer, since this is more specific than "weakly convex." Similarly for "decreasing" (more specific) versus "non-increasing" (less specific) and the like. If a function is both weakly convex and weakly concave, so state.
No proofs are needed, and you may leave the word "everywhere" implicit in your answers.
Given the choices above, there are up to four possible combinations of activation score function and score loss function. For each combination give a single formula for the corresponding sample loss function in terms of $y$ and $\alpha$. Do not expand $\alpha$ to $a + \mathbf{w}^T\mathbf{x}$ to keep the expressions simpler.
If a combination makes no sense mathematically (in that the two functions cannot be combined by composition), write undefined
rather than the formula. Each of your formulas should be written in the form
where $i\in\{1, 2\}$ and $j\in\{1, 2\}$.
Make your formulas as simple as you can notationally.
In addition, for each of the combinations that are not undefined, state if the combined function is globally convex in $\alpha$ or not. No need to prove your statement, but make sure you know why you give the answers you give.
Plot the score loss functions in problem 2.2 that are not undefined, each in its own notebook cell. The independent variable is $\alpha$ in $[-5, 5]$. Each plot should have a curve for $y=0$ and one for $y=1$.
Put a legend that states which curve is which ($y=0$ or $y=1$) and a title that states which score loss function you are plotting. Axis labels are optional.
Let's pick the plots in the previous problem for $y = 0$. If correct, these plots should suggest that the sample loss functions they depict behave rather differently when $\alpha\rightarrow\infty$. All these functions asymptote to polynomials. That is, they can be approximated by $\alpha^k$ for some $k \geq 0$ as $\alpha\rightarrow\infty$. Give the value of $k$ for all the combinations that are not undefined in problem 2.2.
Specify which value of $k$ is for which sample loss function by giving your answers in the form
$$ \ell_i(y, f_j(\alpha)) \rightarrow \alpha^k $$for appropriate values of $i, j, k$. Then state which function is most or least sensitive to values that are classified correctly and are very far from the decision boundary. Assume that the form of the activation $a(\mathbf{x})$ is the same for all samples. No explanation needed.
One of the functions in problem 2.2 is sensitive to samples far from the boundary regardless of whether they are correctly or incorrectly classified. Which one, and how do you know?
The class sklearn.linear_model.LogisticRegression
implements a Multiclass Logistic-Regression Classifier (MLRC). Please read the documentation of this function online. We will use it with default values for all parameters with the lone exception of multi_class='multinomial'
. So you instantiate the classifier with the call
h = LogisticRegression(multi_class='multinomial')
and then train it on training set x
, y
with
h.fit(x, y)
If
then x
is an $n$ by $d$ numpy
array of floats and y
is a numpy
vector with $n$ integers between $0$ and $k-1$.
Of the various attributes of LogisticRegression
you will only need score
, so please read the documentation for that.
Important: The attribute score
is not the same as the score function we talked about in class. That would be the attribute predict_proba
, which we however will not need for this assignment. Instead, score
takes a dataset and returns the classifier's accuracy on that dataset. The accuracy is the fraction of samples in the given dataset that the classifier predicts correctly. It is returned as a number between 0 and 1.
The MNIST dataset of handwritten digits is often used as a toy problem in machine learning courses because it is small and nontrivial. You played with this dataset in homework 4, whether you realized that or not.
However, the MNIST dataset is a very "easy" classification problem, in that it is not hard to achieve very high accuracies on both training and test data with very simple classifiers.
Instead, this part of this assignment will use the Vowel Dataset, which is described here. This dataset poses a very hard classification problem in two ways: It is difficult to obtain decent results with simple classifiers and the dataset is rather small, making it impossible to use classifiers with very large numbers of parameters. Using a difficult set in this way (i) gives you a more realistic picture of what machine learning can be about and (ii) is an example of a case where simple classifiers are really the only option, whether they work well or not. You should not expect very high accuracies in this part.
The vowel dataset was constructed as follows (you can read more in the document linked to above).
Researchers in phonetics made a list of eleven words each of which contains one of eleven distinct vowel sounds in British English. Here are the words (OK, one of them isn't actually a word):
heed, hid, head, had, hard, hud, hod, hoard, hood, who'd, heard
They then asked fifteen speakers to speak these words, once per word per speaker. Eight speakers were used to build the training set and seven speakers, different from the first eight, were used to build the test set. The researchers recorded the words and extracted six short audio clips from the wovel part of each word. The total number of clips is therefore as follows:
Each clip was then processed with some fancy signal processing that resulted in ten floating point numbers that somehow describe the main phonetic characteristics of the clip. We do not need to understand what these characteristics are: This is the advantage of converting a complicated input into a fixed-length vector, remember?
Suffice it to say that, in our customary notation, a sample $(\mathbf{x}, y)$ (either training or test) has $\mathbf{x}\in X\subseteq\mathbb{R}^{10}$ (ten numbers per clip) and $y\in Y = \{0, 1, 2, \ldots, 10\}$ (one of eleven vowel IDs).
As usual, let us download some data and code:
from urllib.request import urlretrieve
from os import path as osp
def retrieve(file_name, semester='fall22', course='371', homework=5):
if osp.exists(file_name):
print('Using previously downloaded file {}'.format(file_name))
else:
fmt = 'https://www2.cs.duke.edu/courses/{}/compsci{}/homework/{}/{}'
url = fmt.format(semester, course, homework, file_name)
urlretrieve(url, file_name)
print('Downloaded file {}'.format(file_name))
retrieve('vowel.train')
retrieve('vowel.test')
retrieve('vowels.py')
Using previously downloaded file vowel.train Using previously downloaded file vowel.test Using previously downloaded file vowels.py
from vowels import vowel_dataset
import numpy as np
dataset = vowel_dataset()
The dataset
is a dictionary with keys train
, test
, and label meaning
.
train
and test
are the training set and the test set, each represented as a dictionary with keys x
and y
with the data in the format explaiend earlier.m = dataset['label meaning']
is a list of the eleven words used in the experiment. Entry m[0]
corresponds to label 0, entry m[1]
to label 1, and so forth.Write a function with header
def train_and_test(data, split='original'):
that takes a dataset like dataset
, instantiates a LogisticRegression
class and fits it to the training set. Then, the function prints the accuracy measured on both the training set and the test set. Each accuracy value should be a percentage, that is, a number between 0 and 100, displayed with two decimals after the period. It should be printed on a separate line that explains which accuracy is being printed. For instance,
training accuracy on the original split is xx.xx percent
The word original
in this line is not hard-wired. Instead, it is taken from the value of the optional parameter split.
Don't worry what this means for now. You will use this function also in a later problem where the data will be split in a different way. The split in dataset
is the "original" split, so for now you will use the default value of split
.
Finally, the function returns the (trained) classifier.
Show your code and what it prints. Call your function as
classifier = train_and_test(dataset, 'original')
so that classifier
can be used later on as well.
Does the classifier generalize well? Explain your answer very briefly.
One way to get some insight about the performance of a multi-class classifer is to see if some classes are harder to recognize than others. Write a function with header
def print_difficult_vowels(h, data):
that takes a classifier and a dataset like dataset
and for each of training set and test set computes (without printing it) the accuracy for each vowel: What is the fraction of samples whose true label is the first vowel that the classifier gets right? And for the second vowel? And so forth.
For each of the two sets (training and test) the function prints which vowel has the lowest accuracy and the value of that accuracy. So the function prints out two lines with the following format:
'lowest {} accuracy is {:.2f} percent on the vowel in "{}" '
where the placeholders specify which set we are talking about (training or test), the accuracy (again, as a percentage) and the word that contains that vowel.
Show your code and what it prints out for classifier
and dataset
as defined earlier.
The vowel dataset is very unusual in the landscape of machine learning datasets because the test set is radically different from the training set: The speakers in the two sets are different.
In contrast, most datasets for machine learning research split a single dataset into training and test set at random. Let us see how that works in this problem.
Write a function with header
def random_split(data):
that takes a dataset like dataset
, merges training and test data, and splits the merged data again into training and test sets uniformly at random, with each set containing about half of the samples (n // 2
in the training set and the rest in the test set).
The function should return a dictionary of the same format as dataset
. Don't forget to copy the list of label meanings.
Show your code and the printout resulting from running test_and_train
on the new split that your function generates from dataset
.
Results for the previous experiment may vary from run to run since the data split is random. You may want to run the experiment a few times to get a sense of this variability. However, do not show those multiple runs.
Compare the results from Problems 3.1 and 3.4 and comment on the following aspects: