COMPSCI 371 Homework 1¶

Homework Submission Workflow¶

When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.

Note on Exam-Style Problems¶

Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.

As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.

Problem 0 (3 points)¶

Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?

Note¶

This question requires no explicit answer from you, and no space is provided for an answer in the template file.

However, the rubric entry for this problem in Gradescope will be used to deduct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.

Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.

Part 1: Polynomial Fitting¶

Warning¶

The problems in this part are linked, in that the results for one problem depend on the results for previous problems being correct. You will be penalized for mistakes even if the answer is wrong only because previous answers are wrong. This is consistent with the real world: All modules of a program must be correct for the program to work, and it is good practice to test each module thoroughly for correctness before moving on to other modules. If any module fails, the whole program fails, and real life does not give partial credit for software that almost works but doesn't.

Monomials¶

The function monomials defined below returns the specification of a multivariate polynomial in $d$ scalar real variables $x_0, \ldots, x_{d-1}$ of degree up to $k$. Specifically, the function returns a numpy array with $d$ columns. Row $j$ of this array contains the powers $k_0^{j}, \ldots, k_{d-1}^{j}$ of the variables $x_0, \ldots, x_{d-1}$ in the $j$-th monomial of the polynomial.

The notation is the same as in the class notes on multivariate fitting, except that variables are numbered starting at zero rather than one to be consistent with Python usage. Monomials specified by the rows of the output array are in order of increasing degree. The ordering within each degree is irrelevant for this assignment.

In [1]:
import numpy as np
from itertools import combinations_with_replacement as combos


def monomials(d, k):
    c = np.array(list(combos(range(d + 1), k)))
    m = np.zeros((c.shape[0], d))
    for j in range(d):
        m[:, j] = np.sum(c == j, axis=1)
    order = np.argsort(np.sum(m, axis=1))
    return m[order]

Problem 1.1 (Exam Style Except for the Code)¶

Give a formula in terms of $d$ and $k$ for the number $r$ of rows in the array returned by monomials. Then write code that outputs a table of $r$ for all combinations of $d = 0,\ldots,15$ and $k = 0,\ldots, 8$.

It is not necessary to label rows and columns of the table. However, make sure that there is one row for each value of $d$ and one column for each value of $k$, and that columns line up. An easy way to achieve this is to make the table into a numpy array and just print that.

Feel free to use the function math.comb if useful.

Problem 1.2¶

Suppose that you are given a numpy vector x with $d$ entries representing the values of $d$ independent variables and the array p = monomials(d, k) where k is some degree.

Write a function with header

def evaluate_monomials(p, x):

that returns a numpy vector y with the values resulting from evaluating each of the monomials specified by p at the values in x.

Show your code and the result of the call

print(evaluate_monomials(monomials(d, k), x))

where x, d, k are defined as follows

In [2]:
x = np.array([0.5, -0.2, 0.1])
d, k = len(x), 2

Programming Note¶

The shapes of the arguments p and x to evaluate_monomials must be consistent with each other, so it is safe programming practice to start the body of the function with assert statements that check that they are. These statements are not required for full credit on this problem, but they are strongly recommended, to prevent mysterious code failures later on.

Problem 1.3 (Exam Style if Table is Provided)¶

Let $\mathbf{x}$ be the vector x defined in the previous problem and let $\mathbf{c}$ be the following vector of coefficients

In [3]:
c = np.linspace(2, 0, 35)

What is $k$ for this problem? You may want to use the table you made in an earlier problem.

Problem 1.4¶

Write a function with header

def evaluate_polynomial(c, p, x):

that uses evaluate_monomials to evaluate the polynomial with coefficients c at the vector of independent variables x if p is the appropriate output from monomials for the given c and x.

Then call evaluate_polynomial on the vectors c and x defined in the previous two problems and with the output from the appropriate call to monomials replaced for p. The order of the coefficients in c is assumed to correspond to the order of the monomials returned by monomials.

Show your code and the resulting numerical value y of the polynomial.

Problem 1.5¶

Write a function with header

def fit(X, y, k):

that uses monomials and evaluate_monomials to fit a multivariate polynomial of degree up to k to the data in X and y.

If there are $n$ data points, X is a two-dimensional numpy array with $n$ rows, and row X[j] contains the values of the independent variables for the $j$-th data sample. The entry y[j] in the numpy vector y is the true value corresponding to the vector X[j].

To show that your code works at least on some basic inputs, fit a straight line, that is, an affine function $y = a + bx$, to the following noisy training data x_train, y_train:

In [4]:
n_train = 100
a, b = 2., 3.
sigma = 0.2
x_train = np.sort(np.random.rand(n_train))
noise = sigma * np.random.randn(n_train)
y_train = a + b * x_train + noise
x_train = np.expand_dims(x_train, axis=1)

Here, x_train is X and y_train is y.

After running fit on this data (with k equal to 1 so you get an affine function), do the following:

  • Print the coefficients $a$ and $b$ of the resulting line $y = a + bx$, in this order. Remember that monomials outputs monomials in order of increasing degree.
  • Say if the coefficients are close to what you would expect.
  • Print the value of the residual error of the solution, that is, the square root of the risk on the training set using the quadratic loss.
  • Plot the points in the training data as dots and overlay the solution line on your plot. No need to label the axes.

For full credit, your code should comply with the programming notes below.

Programming Notes¶

  • For clean programming it is best to break up the definition of fit and introduce an auxiliary function with header

      def fitting_matrix(p, X):
    
    

    that computes the matrix of the linear system to be solved.

  • To solve a linear system in the least squares sense use the function numpy.linalg.lstsq. Look it up to see how it works.
  • You will need to compute residuals also in later problems, so it's best to be general right away: Write a function with header

      def predict(c, k, X):
    
    

    that uses monomials and fitting_matrix to evaluate the polynomial with coefficients c and degree up to k on the data X and a function with header

      def residual_error(c, k, X, y):
    
    

    that uses predict to compute the residual error.

Problem 1.6¶

Let us now explore generalization using a rather weird problem as a sandbox. Specifically, we want to see if we can learn a polynomial predictor for the larger eigenvalue of a symmetric $2\times 2$ real matrix. Since the matrix is real and symmetric it has two (possibly coincident) real eigenvalues, and we are interested in the larger of the two, let us call that $\lambda_{\max}$.

Of course, we don't need machine learning for this, as the function numpy.linalg.eig computes eigenvalues. Alternatively, since the matrix is $2\times 2$, it is easy to write a formula for $\lambda_{\max}$ in terms of the entries of the matrix: If the matrix is

$$ A = \left[\begin{array}{cc} a & b \\ b & c \end{array} \right] $$

then

$$ \lambda_{\max} = \frac{a + c + \sqrt{(a - c)^2 + 4b^2}}{2} $$

where the formula is written so as to make it obvious that the result is real.

However, the point here is generalization, not eigenvalues, so let's move on. The idea is to encode $A$ by the data vector $\mathbf{x} = (a, b, c)$ and the corresponding $\lambda_{\max}$ is $y$.

The following function makes a training set with $n$ samples in the form of an $n\times 3$ matrix $X$ of $\mathbf{x}$ vectors and a vector $\mathbf{y}$ with the corresponding $n$ true eigenvalues $y$ computed (to within numerical accuracy) with numpy.linalg.eig.

In [5]:
def eigen_data(n):
    X = np.random.randn(n, 3)
    matrices = np.reshape(X[:, [0, 1, 1, 2]], (n, 2, 2))
    y = np.array([np.max(np.linalg.eig(matrix)[0]) for matrix in matrices])
    return X, y

Rather than just a training set $T$ we now also make a test set $S$. Let's make them equal in size, although that is generally not the case in machine learning.

In [6]:
n_train, n_test = 100, 100
x_train, y_train = eigen_data(n_train)
x_test, y_test = eigen_data(n_test)

Write code that makes a single diagram with two plots. One plot is for the residual error (as computed with residual_error defined earlier) on the training data, the other on the test data. The two plots show the values of these errors when the data is fit with polynomials of degree $ k \in \{0,\ldots, k_{\max}\}$ where $k_{\max} = 10$. The value of $k$ is on the abscissa of the diagram.

Since the values of the residual error range over a large interval, use the function matplotlib.pyplot.semilogy rather than matplotlib.pyplot.plot, to make it easier to see all the values.

Your diagram should have a legend that clarifies which plot is which. Label the axes meaningfully.

Problem 1.7 (Exam Style)¶

Answer the following questions and give a brief and clear explanation for each of your answers.

  1. Why can we not expect the predictor to do perfectly (zero error) on the test set (which were not used for training) even when a large polynomial degree and an infinitely large training set were used, and even with perfect arithmetic? This is a question about eigenvalues and perhaps calculus, not about machine learning.
  2. The predictor performs nearly exactly (that is, exactly to within numerical accuracy) on the training set starting with some polynomial degree. What is that degree, and why is the training risk essentially zero from then on? In particular, what is so special about that degree? Relate your answer to previous problems.
  3. If you had to use one of these predictors on previously unseen data, what degree would you use and why?
  4. Using the degree in your last answer, is performance on the test set good? Be quantitative in your answer. This also requires you to figure out what "good performance" might mean.

Part 2: Points in Many Dimensions¶

The function gaussian_sample below returns n points in d dimensions, drawn from an isotropic Gaussian distribution with zero mean. "Isotropic" means that the covariance is some multiple of the identity matrix, so the components of each vector in the sample are mutually uncorrelated and identically distributed.

Through some fancy mathematics that is irrelevant to this assignment (read this article if you are interested), the multiplier for the covariance matrix is chosen so that the statistical mean of the distances between pairs of points from the distribution is equal to one. Of course, this will hold only approximately for the empirical mean of a finite sample.

In [7]:
from math import gamma, sqrt

def gaussian_sample(n, d):
    mean = np.zeros(d)
    mean_distance = 2 * gamma((d + 1) / 2) / gamma(d / 2) if d < 200 else sqrt(2 * d)
    covariance = np.eye(d) / pow(mean_distance, 2)
    return np.random.multivariate_normal(mean, covariance, size=n)

Given a numpy vector x of real-valued scalars and an integer d, the function show_histogram below plots a histogram of x and gives the plot a title that reports the value of d, interpreted as the dimensionality of some set, and the mean of the values in x.

In [8]:
def show_histogram(x, d):
    density, _, _ = plt.hist(x=x, bins='auto', density=True, rwidth=0.9)
    mean = np.mean(x)
    plt.grid(axis='y', alpha=0.75)
    plt.xlabel('Distance')
    plt.ylabel('Empirical Density')
    format_string = '{} dimension{}. Mean distance {:.2f}'
    plt.title(format_string.format(d, 's' if d > 1 else '', mean))

Problem 2.1¶

Write a function with header

def distances(points):

that takes a numpy array of points as returned by gaussian_sample and computes the Euclidean distances between all pairs of points in the array. The (zero) distances between each point and itself should not be included. However, if the distance between two distinct points happens to be zero (an event of probability zero), the distance should be included.

Show your code and the histogram obtained for a sample of 300 points on the plane (that is, $d=2$) generated by gaussian_sample.

Using nested for loops on al pairs of points is OK.

Problem 2.2¶

Show three plots side to side (use plt.subplot), each showing the same histogram as in the previous Problem but in 10, 100, and 1000 dimensions.

Problem 2.3¶

Describe in a single sentence how the four plots in the previous two problems differ from each other qualitatively. Pay attention to the different abscissas.

Problem 2.4¶

A nearest-neighbor classifier is a classifier we will study in class in some detail. At training time, it merely memorizes all the samples $(\mathbf{x}_n, y_n)$ in the training set $T$.

When the classifier is deployed on a new data point $\mathbf{x}$, it searches $T$ for the sample $(\mathbf{x}_{\hat{n}}, y_{\hat{n}})$ for which the distance between $\mathbf{x}$ and $\mathbf{x}_{\hat{n}}$ is smallest (that is, it searches for the nearest neighbor to $\mathbf{x}$ in $T$). It then returns the corresponding label $y_{\hat{n}}$ as its estimate for the label $\hat{y}$ for $\mathbf{x}$.

This is a form of rote learning : Just memorize past questions and answers. When confronted with a new question, give the answer for the most similar question you have seen in the past.

The nearest-neighbor classifier works really well in low-dimensional spaces, but its performance degrades as the dimensionality of the domain $X$ of the classifier increases.

Assuming that the Euclidean metric is used to determine distances, how can you explain this degradation of performance in light of the plots you made in the previous two problems?