COMPSCI 371 Homework 4¶

Homework Submission Workflow¶

When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.

Note on Exam-Style Problems¶

Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.

As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.

Problem 0 (3 points)¶

Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?

Note¶

This question requires no explicit answer from you, and no space is provided for an answer in the template file.

However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.

Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.

Software for this Assignment¶

Run the code in the following two cells before you start working on this assignment. It will retrieve two Python files that are necessary to do the work.

In [1]:
from urllib.request import urlretrieve
from os import path as osp


def retrieve(file_name, semester='fall22', course='371', homework=4):
    if osp.exists(file_name):
        print('Using previously downloaded file {}'.format(file_name))
    else:
        fmt = 'https://www2.cs.duke.edu/courses/{}/compsci{}/homework/{}/{}'
        url = fmt.format(semester, course, homework, file_name)
        urlretrieve(url, file_name)
        print('Downloaded file {}'.format(file_name))
In [2]:
for file_name in ('mlp.py', 'checks.py'):
    retrieve(file_name)
Using previously downloaded file mlp.py
Using previously downloaded file checks.py

Part 1: Line Search¶

We are looking for a local minimum of the function $\mathbb{R} \rightarrow \mathbb{R}$ defined as

$$ y = f(\alpha) = \sin\left(\frac{\pi \alpha + 1}{2}\right)\;, $$

starting at $\alpha_0 = 0$.

Since the function is univariate, we can just use line search.

The bracketing triple at some point during line search happens to be

$$ (a, b, c) = (0, 2, 5) \;. $$

Note:¶

Some of the problems in this part ask you to calculate values of functions. You may use any calculator you want, and Python is an obvious choice. However, please do not include any code you may use to compute these values, except when you are explicitly asked to do so.

When reporting numerical values, please give four decimal digits after the period.

Problem 1.1¶

Write Python code to plot the function $f(\alpha)$ for $\alpha\in [-2, 6]$, place red dots at the points $(a, f(a))$, $(b, f(b))$, $(c, f(c))$, and print out the values $f(a)$, $f(b)$, $f(c)$ in the format specified above.

Show your code and your plot and printout. No need to label the axes. Specify which value is which in your printout.

Output Format

Image of the plot - lineplot with red dots as mentioned above

f(a) = x.xxxx, f(b) = x.xxxx, f(c) = x.xxxx

Problem 1.2 (Exam Style)¶

Prove that the triple $(a, b, c)$ given above is indeed a valid bracketing triple. Please remember that there are conditions on both $\alpha$ and $f(\alpha)$.

Problem 1.3 (Exam Style)¶

Give the values of the bracketing triple $(a', b', c')$ that the line search algorithm discussed in the class notes computes immediately from the triple given above (no golden-ratio search). Explain your reasoning briefly using the notation given in the class notes.

Problem 1.4 (Exam-Style)¶

If line search is continued from the values given above, will it eventually converge to the local minimum of $f$ that is closest to the starting point $\alpha_0 = 0$? Explain briefly. Either way, give an exact expression and approximate numerical value for the point $\alpha^{\ast}$ that line search will converge to (after infinitely many iterations), and the corresponding function value $f(\alpha^{\ast})$.

Part 2: Stochastic Gradient Descent¶

Stochastic Gradient Descent (SGD) is used to minimize functions $f(\mathbf{z})$ where $\mathbf{z}$ lives in a large space and $f$ is an average of many other functions:

$$ f(\mathbf{z}) = \frac{1}{N} \sum_{n=1}^N \phi_n(\mathbf{z}) \;. $$

A prime example of this type of functions is the risk function defined for a machine learning problem. In this case, $\mathbf{z}$ is a vector that contains all the parameters in the ML predictor $h(\mathbf{x}; \mathbf{z})$. As discussed in class, the function $\phi_n$ is the loss that the predictor $h$ incurs on sample number $n$ in a training set $T$ with $N$ samples.

The code in the file mlp.py you downloaded earlier contains such a predictor, namely, a Multi-Layer Perceptron, which is a simple example of a neural network. It has $m = 7510$ parameters, so the variable $\mathbf{z}$ is a vector with that many entries.

The code also retrieves $N = 1797$ images of handwritten digits from the MNIST database (dscribed in class) and the corresponding labels. The images are tiny, just $8\times 8$ black-and-white pixels, for a total of $d = 64$ numbers per image. These are stored in the $N\times d$ array x_digits imported below. The corresponding labels (a label is a digit between 0 and 9 inclusive) are stored in the $N$-dimensional vector y_digits, also imported below.

You need not know what a MLP is or how it works. All you care about is that before you use it you need to reset it as shown in the code cell below. The reset_network function returns a brand new network and a starting point $\mathbf{z}_0$ for SGD.

Important: If you run two separate experiments, you need to call reset_network before each experiment, or else the network will remember what it learned in the previous experiments.

In [3]:
from mlp import x_digits, y_digits, reset_network, risk_and_gradient, plot_risk_curves

network, z0 = reset_network()

If you run the commands above and get error messages, chances are that you are using an old version of scikit-learn. Make sure you update it with

conda update -n compsci371 -c conda-forge scikit-learn

and then restart jupyter from the compsci371 environment.

The following command should print version number 1.1.1 or later.

In [4]:
import sklearn

print(sklearn.__version__)
1.1.1

The risk function $f(\mathbf{z})$ has a few twists, so you cannot just call it with argument $\mathbf{z}$ and expect it to return the value of the risk.

  • First, SGD requires you to be able to run $f$ on a subset of the data, so you also need to pass the data to $f$.
  • Second, $f$ needs to know the structure of the MLP, so you need to pass that as well.
  • Third, SGD also needs to know the gradient of $f$, and it is convenient to have a single function that returns both the value $f(\mathbf{z})$ of the risk and the gradient $\nabla f(\mathbf{z})$ at $\mathbf{z}$.

Thus, the function $f$, which is called risk_and_gradient in the code, is used as follows:

risk, g = risk_and_gradient(z, x_batch, y_batch, network)

The first argument is initially z0 (returned by reset_network above). The arguments x_batch and y_batch are just x_digits and y_digits if you want to compute the risk and its gradient on the entire training set. Otherwise, these arguments are corresponding slices of x_digits and y_digits. For instance, if you want to compute risk and gradient on the first ten training samples you would set

x_batch, y_batch = x_digits[:10], y_digits[:10]

The last argument network is the MLP, which was returned by reset_network above.

The returned values should be self-explanatory. The risk is a positive floating-point number, which with enough training can be brought very close to zero, and the gradient g is a numpy vector with $m = 7510$ entries, one for each parameter of the MLP.

This part of the assignment is about SGD, not neural networks, so you can regard risk_and_gradient as a black-box function $f$ that you want to minimize. However, if you want to know how well the MLP is doing on the training set after some amount of training, you can find its accuracy, that is the fraction of training examples it gets right, with the command

print('training accuracy: {}'.format(network.score(x_digits, y_digits)))

If the number printed is, say, 0.983, then the MLP gets 98.3 percent of all $N = 1797$ training samples right. A lower value of the risk corresponds to a higher accuracy, but the relation between the two is not straightforward.

Problem 2.1¶

The gradient computed over a batch is an approximation of the true gradient, which is the gradient computed over the entire training set. Smaller batches obviously lead to less accurate approximations. To see this, write a function with header

def direction_spread(z, x, y, mlp, batch_size):

that computes the average spread in the orientations of the gradients computed over all the full batches in the training set x, y.

Full Batches¶

A "full" batch is a batch that has exactly batch_size samples. Suppose for instance that batch_size is set to 100. Since the training set has $N=1797$ samples, we can form 17 full batches (since len(y_digits) // batch_size = 17). The full batches are consecutive groups of batch_size samples from x_digits, y_digits. Thus, full batches are formed from samples with indices 0:batch_size, batch_size:(2 * batch_size), and so forth. In the example, the last 97 samples in the training set are ignored.

Direction Spread¶

If the true gradient over the training set is $g_t$, then we can compute tha angle in radians between $g_t$ and the gradient $g$ from a given batch as

$$ \omega = \arccos \frac{g_t}{\|g_t\|} \cdot \frac{g}{\|g\|} \;. $$

The direction spread is the mean of these angles over all full batches.

Your Task¶

Show your code and a listing of the direction spreads in degrees for the batch sizes in the following sequence and then comment very briefly on the results.

In [5]:
batch_sizes = (5, 10, 20, 50, 100, 200, 500, 1000)

Programming Notes¶

  • Compute the spreads for z and mlp set to the vector of initial parameters z0 and the newtork network returned by reset_network(). Use x_digits and y_digits for x and y.
  • Each spread should be printed on a separate line with the batch size and the spread in degrees. Use three digits after the decimal period for spread values.

Output Format

batch size xxxx, spread xx.xxx degrees
batch size xxxx, spread xx.xxx degrees
...
...

Problem 2.2¶

In this problem you wil implement a basic version of SGD. When compared to a "production quality" version, your version will have both something missing and something extra.

Missing will be fancy acceleration techniques suuch as variable learning rates, momentum, and so forth. Your code will only use a single, fixed value for the learning rate.

Extra will be some ways to track progress in a much more detailed way than would be feasible with a larger training problem or neural network.

Your function definition should start as follows, so we are all on the same page:

In [6]:
def sgd(z, x, y, mlp,
        n_epochs=5, batch_size=200,
        epoch_alpha=1., record=False):
    rng = np.random.default_rng(0)
    n_samples = len(y)
    # Possibly miss a few data points at the end for simplicity
    n_batches = n_samples // batch_size
    alpha = epoch_alpha / n_batches
    step = 0
    if record:
        training_risks = [(step, risk_and_gradient(z, x, y, mlp)[0])]
        batch_risks = []
    else:
        training_risks, batch_risks = None, None
    for epoch in range(n_epochs):
        indices = rng.permutation(n_samples)
        b_start, b_stop = 0, batch_size
        for b in range(n_batches):
            step += 1
            batch_indices = indices[b_start:b_stop]
            x_batch, y_batch = x[batch_indices], y[batch_indices]

A few things to note:

  • A random number generator is initialized in a standard way (rng = np.random.default_rng(0) so that all our experiments will look similar. You would not do this in production-quality code.
  • The definition of n_batches shows that we are only using full batches, as defined in the previous problem. This is OK, because the first instruction in a new epoch scrambles the data (indices = rng.permuation(n_samples)), so training samples missed in one epoch are likely to show up in another.
  • The learning rate alpha is proportional to the size of a batch (because of the definition alpha = epoch_alpha / n_batches). This makes experiments with different learning rates more comparable to each other, since in this way the distance traveled over one epoch is on average the same regardless of batch size, given that all batch gradients are equal on average to the true gradient.
  • A first if record block in the code above initializes two lists of risk values, as described next. Part of your job will be to add further statements under other if record clauses to update these lists appropriately.
    • training_risks will contain the value of the risk over the whole training set at the beginning (step = 0) and after each step (that is, after each batch is processed). It would be catastrophic to compute these values for larger training sets or neural networks. After all, the whole point of SGD is to avoid running computations on the whole training set all at once. On the other hand, recording these risks for our small experiments gives an unusual and interesting view of how SGD behaves, that is, how much progress it actually makes in fitting the neural network to the entire training set.
    • batch_risks will contain the value of the risk over the current batch both before and after that batch is processed. This shows how much progress SGD makes on each batch, rather than on the entire training set. The batch_risks are really what SGD "sees," since it processes one batch at a time.

Your Task¶

Your job is to complete the code of sgd. You will then reset the network (network, z0 = reset_network()), call sgd on the values that reset_network returns and on the training set x_digits, y_digits and with different combinations of values for the optional arguments. You will then call the provided function plot_risk_curves (imported earlier from mlp.py) to plot the training_risks and batch_risks.

In this first experiment, use batch_size=200, epoch_alpha=1., n_epochs=5, record=True.

For all of this to work it is important that you follow the programming notes below accurately.

Programming Notes¶

  • From a substantive point of view (that is, apart from keeping track of progress if record is True) there are only four lines missing from the code above, as described next. Write these four lines first.

    • One line computes the risk r_batch_before and gradient g on the current batch before the value of z is updated.
    • One line uses g to update the parameter vector z. This is the actual SGD step.
    • One line updates b_start and b_stop at the end of each step to move to the next batch.
    • One line at the very end of the function returns the final value of z and the two lists training_risks and batch_risks.
  • To keep track of progress, add an if record block after the line that updates z. This block records the following risk values:

    • r_batch_after is the risk on the current batch after z has been updated.
    • r_training_afteris the risk on the entire training set after z has been updated.
    • The pair ((step - 1, r_batch_before), (step, r_batch_after)) is appended to batch_risks.
    • The pair (step, r_training_after) is appended to training_risks.
  • The function plot_risk_curves you imported earlier has the following header:

      def plot_risk_curves(training_risks, batch_risks, title=None, fs=16):
    
    

    When you call it, leave fs (font size) to its default value. The title should indicate the values of n_epochs, batch_size, and epoch_alpha used in the experiment.

  • In later problems you will repeat the experiment in this problem with different values for epoch_alpha and batch_size, so encapsulate all the necessary code into a function with header

      def sgd_experiment(batch_size, epoch_alpha, n_epochs, record):
    
    
    • Make sure you pass the arguments of sgd_experiment to sgd. The function sgd_experiment should also print out the accuracy (as explained just before Problem 2.1) and the final training risk over the entire training set. You will find this value in the appropriate place in training_risks. Display three decimal digits after the period for all real-valued quantities.
    • The function sgd_experiment should also reset the network.
    • The function sgd_experiment always calls sgd with its argument record set to its default value of True. You may want to try out sgd by itself with record=False just to make sure you don't have any bugs. Do not show this test in what you turn in.
    • The call for this problem is

        sgd_experiment(batch_size=200, epoch_alpha=1., n_epochs=5)

Problem 2.3 (Exam Style)¶

Here are three observations on the plot you made in the previous problem (assuming the plot is correct):

  • The value of the batch risk decreases consistently for each batch.
  • The value of the training risk decreases consistently for each batch.
  • The value of the batch risk typically jumps up or down between batches.

Explain each of these observations briefly in this order.

Problem 2.4 (Exam Style except for Making the Plot)¶

Show the result of running

sgd_experiment(batch_size=10, epoch_alpha=1., n_epochs=5)

This will take several seconds to run because of the many evaluations of the training risk needed for the plots.

Then comment on differences and similarites relative to the plot in the previous problem. Comment in particular on the final values of risk achieved, on the variance of the batch risks, and on the steadiness in the decrease of the training risk. Explain whenever you know how.

Problem 2.5 (Exam Style except for Making the Plots)¶

Show the results of running

sgd_experiment(batch_size=200, epoch_alpha=.01, n_epochs=5)
sgd_experiment(batch_size=200, epoch_alpha=50., n_epochs=5)

in separate notebook cells and explain briefly what you see.

Problem 2.6 (Exam Style except for Making the Plot)¶

Show the result of running

sgd_experiment(batch_size=200, epoch_alpha=1., n_epochs=80)

Is the final predictor accurate? Is it guaranteed to generalize well? Explain very briefly

Part 3: Hyperplanes¶

Let $d$ be a positive integer. A hyperplane in $\mathbb{R}^d$ is the solution to a linear equation $\tilde{\mathbf{a}}^T\mathbf{z} = \tilde{b}$ with nonzero vector $\tilde{\mathbf{a}}\in\mathbb{R}^d$.

We use tildes because we will soon transform both $\tilde{\mathbf{a}}$ and $\tilde{b}$ to make it easier to work with them.

Examples:

  • The only hyperplanes in $\mathbb{R}$ are real numbers (that is, points on $\mathbb{R}$). This is because the only $1\times 1$ equation with nonzero coefficients on the left are of the form $\tilde{a} x = \tilde{b}$ for $\tilde{a}\neq 0$. By varying $\tilde{a}$ and $\tilde{b}$ one can obtain all real numbers $x = \tilde{b}/\tilde{a}$.
  • Hyperplanes on the plane, that is, in $\mathbb{R}^2$, are lines.
  • Hyperplanes in $\mathbb{R}^3$ are planes.

Implicit Representation of Hyperplanes¶

The obvious representation of a hyperplane is the one given in its definition, that is, the parameters of the equation $\tilde{\mathbf{a}}^T\mathbf{z} = \tilde{b}$. In a computer program you would store the pair $(\tilde{\mathbf{a}}, \tilde{b})$. This is called the implicit representation of the hyperplane. It is most useful when one is given a point $\mathbf{z}_0$ and needs to check if it belongs to some hyperplane: Just compute $\tilde{\mathbf{a}}^T\mathbf{z}_0$ and verify if it is equal to $\tilde{b}$.

A Nearly-Canonical Implicit Form¶

There are infinitely many choices for $\tilde{\mathbf{a}}$ and $\tilde{\mathbf{b}}$ for the same hyperplane. To make the implicit representation a bit more canonical it is customary to choose $\tilde{\mathbf{a}}$ to have unit norm. The value of $\tilde{b}$ must then be modified accordingly. Of course, there are two unit-norm vectors for each $\tilde{\mathbf{a}}$, and that's why this new representation $(\mathbf{a}, b)$ is only "nearly" canonical.

For brevity, we will use the term "canonical" instead of "near canonical" for $(\mathbf{a}, b)$.

Parametric Representation of Hyperplanes¶

In some cases it is necessary to generate points on the hyperplane. For instance, to plot part of a line you typically generate at least two points on it and connect them with a line segment. In those cases, a parametric representation is more convenient. This is in the form

$$ \mathbf{z} = \tilde{\mathbf{p}} + \tilde{Q} \boldsymbol{\alpha} $$

where $\tilde{\mathbf{p}}$ is any point on the hyperplane, the columns in the $d\times (d-1)$ matrix $\tilde{Q}$ form a basis for a linear space parallel to the hyperplane (and $\tilde{Q}$ is therefore full rank), and $\boldsymbol{\alpha}$ is a vector of $d-1$ real numbers. Then, any choice of $\boldsymbol{\alpha}$ leads to some point $\mathbf{z}$ on the hyperplane.

A Nearly-Canonical Parametric Form¶

There are inifitely many choices for $\tilde{\mathbf{p}}$ and $\tilde{Q}$ for the same hyperplane. To make the parametric representation a bit more canonical it is customary to choose $\tilde{\mathbf{p}}$ as the point $\mathbf{p}$ on the hyperplane that is closest to the origin, and to replace $\tilde{Q}$ with a matrix $Q$ with orthonormal columns. Even so, there are of course many such matrices.

For brevity, we will use the term "canonical" instead of "near canonical" for $(\mathbf{p}, Q)$.

Problem 3.1 (Exam Style)¶

Let us consider hyperplanes on the plane, so that $d=2$ and the hyperplanes are lines. Answer the following questions for the line on the plane that solves the equation

$$ \tilde{\mathbf{a}}^T \mathbf{z} = \tilde{b} $$

where $\tilde{\mathbf{a}}$ is an arbitrary nonzero vector. Give very brief explanations.

  1. How do you transform $\tilde{\mathbf{a}}$ and $\tilde{b}$ to make the implicit representation of this line canonical? Specifically, give formulas for one possible choice of $\mathbf{a}$ and $b$ in terms of $\tilde{\mathbf{a}}$ and $\tilde{b}$.
  2. What is the only one other canonical implicit representation of this line?
  3. Give a canonical parametric representation $\mathbf{z} = \mathbf{p} + \mathbf{q}\alpha$ for this line. Specifically, give formulas for one possible choice of $\mathbf{p}$ and $\mathbf{q}$ in terms of $\mathbf{a}$ and $b$ (this is easier than writing the formulas in terms of $\tilde{\mathbf{a}}$ and $\tilde{b}$).
  4. What is the only one other canonical parametric representation of this line?

Hints¶

  • Study the class notes before working on this problem. This is always a good idea. The notes use $\mathbf{w}$ instead of $\mathbf{a}$ and $-b$ instead of $b$, but the mapping is otherwise straightforward.
  • It is best to think geometrically for the third question. Draw a line on some piece of scrap paper, identify all the relevant quantities ($\mathbf{p}$, $\mathbf{q}$, $\mathbf{a}$, $b$) in your drawing, and see how they relate to each other. Specifically, the equation $\mathbf{a}^T\mathbf{z} = b$ states that for every point $\mathbf{z}$ on the line the inner product of $\mathbf{z}$ with $\mathbf{a}$ is the same (and equal to $b$). What does that say about the orientation of $\mathbf{a}$ relative to the line? Given that, how does $\mathbf{a}$ relate to $\mathbf{q}$? And if $\mathbf{a}$ has unit norm, what does that tell you about $b$?

Problem 3.2 (Exam Style)¶

Let us now do the reverse, still for $d = 2$. Specifically, given a non-canonical parametric representation for a line in the form

$$ \mathbf{z} = \tilde{\mathbf{p}} + \tilde{\mathbf{q}}\alpha $$

answer the following questions, giving very brief explanations.

  1. How do you transform $\tilde{\mathbf{p}}$ and $\tilde{\mathbf{q}}$ to make the implicit representation of this line canonical? Specifically, give formulas for one possible choice of $\mathbf{p}$ and $\mathbf{q}$ in terms of $\tilde{\mathbf{p}}$ and $\tilde{\mathbf{q}}$.
  2. Give a canonical implicit representation $\mathbf{a}^T \mathbf{z} = b$ for this line. Specifically, give formulas for one possible choice of $\mathbf{a}$ and $b$ in terms of $\mathbf{p}$ and $\mathbf{q}$ (again, this is easier than writing the formulas in terms of $\tilde{\mathbf{p}}$ and $\tilde{\mathbf{q}}$).

Hints¶

  • Again, think geometrically. The drawing is the same as the one you used for the previous problem, just used in reverse. Then translate to algebra.
  • Finding $\mathbf{q}$ is easy, so you may want to start there.
  • To find the point $\mathbf{p}$ closest to the origin you could just recall concepts about orthogonal projections fro linear algebra. Alternatively, here is a way that is based on first principles:
    • Replace the given $\tilde{\mathbf{p}}$ for $\mathbf{z}$ in the parametric equation and then find the corresponding $\alpha$. The parameter $\alpha$ is a scalar while the parametric equation has two components, so it is equivalent to two equations in one unkown. One way to make the equation scalar is to multiply it from the left with a suitable row vector.
    • Use all you know about the various vectors to simplify the algebra: How does $\mathbf{p}$ relate to $\mathbf{q}$? What is the norm of $\mathbf{q}$? Answers to these questions will also indicate what row vector to multiply into the parametric equation.
  • If you have different ways to reason in mind, that's fine, too.
  • Write your result as simply as possible notation-wise, since you will then be asked to generalize that result to hyperplanes in more dimensions.

Problem 3.3¶

Write a function with header

def implicit_to_parametric_2d(tilde_a, tilde_b):

that takes the (non-canonical) implicit representation $(\tilde{\mathbf{a}}, \tilde{b})$ of a line on the plane and returns a canonical parametric representation $(\mathbf{p}, \mathbf{q})$ of the same line.

The argument tilde_a is a numpy array of shape (2,) and tilde_b is a floating point number. You may assume without checking that tilde_a is nonzero.

Show your code. Then for each line in the list implicit_lines defined below print out the inputs and outputs of implicit_to_parametric_2d. Your printout should make it clear which parameter is which, and print all values with up to three decimal digits after the period.

In [7]:
import numpy as np

implicit_lines = [
    (np.array((0., 2)), 6.),
    (np.array((3., -3.)), 9.),
    (np.array((1., 2.)), 1.)
]

Note¶

In a later problem you will be asked to write a function that checks if a parametric and an implicit representation represent the same hyperplane. This will let you go back here and verify that your outputs are correct. On the other hand, the test lines provided here are simple enough that you should be able to verify correctness by hand. You need not show the outcome of this verification here.

Problem 3.4¶

Write a function with header

def parametric_to_implicit_2d(tilde_p, tilde_q):

that takes the (non-canonical) parametric representation $(\tilde{\mathbf{p}}, \tilde{\mathbf{q}})$ of a line on the plane and returns a canonical implicit representation $(\mathbf{a}, b)$ of the same line.

The arguments tilde_p and tilde_q are numpy arrays of shape (2,). You may assume without checking that tilde_q is nonzero.

Show your code. Then for each line in the list parametric_lines defined below print out the inputs and outputs of parametric_to_implicit_2d. Your printout should make it clear which parameter is which, and print all values with up to three decimal digits after the period.

The Note in the previous problem applies here as well.

In [8]:
parametric_lines = [
    (np.array((-1., -1.)), np.array((2., 1.))),
    (np.array((0., 0.)), np.array((np.sqrt(3.), 1.))),
    (np.array((2., -3.)), np.array((1., 0.)))
]

Problem 3.5¶

Hopefully the problems above gave you a concrete sense for how these things work. We'll now use those intuitions to write conversion functions that work in any number of dimensions $d = 2, 3, \ldots$. This will be done by numerical methods, as manual algebraic calculations won't carry us far in the general case.

Incidentally, there is nothing wrong with $d = 1$. Ignoring that case here just leads to minor simplifications to the code, since in $\mathbb{R}^1$ some of the relevant spaces are empty.

Orthogonal Complements¶

The key issue in arbitrary numbers of dimensions is that there isn't just one vector orthogonal to the vector $\tilde{\mathbf{a}}\in\mathbb{R}^d$, but a whole $d-1$-dimensional space of them. For instance, when $d = 3$, the space of vectors orthogonal to $\tilde{\mathbf{a}}$ is spanned by two vectors, not just one.

In any number of dimensions, the space spanned by $\tilde{\mathbf{a}}$ and the space of all vectors orthogonal to $\tilde{\mathbf{a}}$ are the orthogonal complements to each other in $\mathbb{R}^d$. So now instead of a single vector $\tilde{\mathbf{q}}$ there is a $d\times (d-1)$ matrix $\tilde{Q}$ whose columns span the orthogonal complement to the space spanned by $\tilde{\mathbf{a}}$.

Implicit and Parametric Representations of Hyperplanes in More Dimensions¶

The implicit representation is the same as before, just with longer vectors:

$$ \tilde{\mathbf{a}}^T \mathbf{z} = b $$

while the parametric representation is now

$$ \mathbf{z} = \tilde{\mathbf{p}} + Q \boldsymbol{\alpha} $$

where $Q$ is $d\times (d-1)$ and $\boldsymbol{\alpha}$ is a vector of $d-1$ numbers.

Computing Orthogonal Spaces¶

Suppose that the columns of some $d\times k$ matrix $S$ with rank $r$ form a basis (orthonormal or otherwise) for some $r$-dimensional linear subspace of $\mathbb{R}^d$. Then the Singular Value Decomposition (SVD) of $S$ yields easily the following:

  • The rank $r$ of $S$.
  • A $d\times r$ matrix $M$ that has orthonormal columns and spans the same column space as $S$
  • A $d\times (d-r)$ matrix $N$ that has orthonormal columns and spans the orthogonal complement of the column space of $S$.

This computation is encapsulated in the following function:

In [9]:
import numpy as np


def bases(s, tolerance=1.e-6):
    assert s.shape[0] >= s.shape[1], \
        'input matrix must have no fewer rows than columns'
    u, sigma, _ = np.linalg.svd(s)
    threshold = sigma[0] * s.shape[0] * tolerance
    rank = np.sum(sigma > threshold)
    m, n = u[:, :rank], u[:, rank:]
    return m, n, rank

Here is a usage example for a random $5\times 3$ matrix s of rank 2. Make sure you understand what is going on, especially with the output matrices below that contain all or mostly zeros.

In [10]:
dd, kk, rr = 5, 3, 2
# left and right are random and therefore full rank with probability 1
left = np.random.rand(dd, rr)
right = np.random.rand(rr, kk)
ss = np.dot(left, right)  # This is a way to make a rank-rr matrix

mm, nn, rr = bases(ss)

with np.printoptions(precision=3, suppress=True):
    print('M:', mm, sep='\n', end='\n\n')
    print('M^T M:', np.dot(mm.T, mm), sep='\n', end='\n\n')
    print('N:', nn, sep='\n', end='\n\n')
    print('N^T N:', np.dot(nn.T, nn), sep='\n', end='\n\n')
    print('M^T N:', np.dot(mm.T, nn), sep='\n', end='\n\n')
    print('rank {}'.format(rr))
M:
[[-0.516 -0.453]
 [-0.361  0.773]
 [-0.629 -0.305]
 [-0.419  0.31 ]
 [-0.178  0.097]]

M^T M:
[[ 1. -0.]
 [-0.  1.]]

N:
[[ 0.458  0.564  0.015]
 [ 0.497 -0.11  -0.117]
 [-0.166 -0.681 -0.137]
 [-0.715  0.449 -0.125]
 [-0.063 -0.061  0.975]]

N^T N:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

M^T N:
[[-0. -0.  0.]
 [-0. -0. -0.]]

rank 2

For our problem we can set $S$ (or s in the code) to $\tilde{\mathbf{a}}$ when we want to find $Q$ from $\tilde{\mathbf{a}}$, or set $S$ to $\tilde{Q}$ if we want to go the other way around, from $\tilde{Q}$ to $\mathbf{a}$.

In both cases, the answer is $N$. The matrix $M$ is $\mathbf{a}$ in the first case and $Q$ in the second. You won't need the third return value (rank), unless you want to check things for yourself.

Other than computing orthogonal spaces, the rest of the math is more or less the same in arbitrary $d \geq 2$ as it was for $d = 2$. However, you need to pay attention to array shapes and transposes to make sure everything adds and multiplies correctly. Working out a small example by hand for both $d = 2$ and $d = 3$ may help.

The Problem¶

Write two functions with the following headers:

def implicit_to_parametric(tilde_a, tilde_b):

def parametric_to_implicit(tilde_p, tilde_q):

The first two functions behave like the corresponding 2D versions, but work for dimensions 2 or greater. In terms of shapes,

  • tilde_a is a numpy vector with $d$ entries
  • tilde_b is a floating point number
  • tilde_p is a numpy vector with $d$ entries
  • tilde_q is a $d\times (d-1)$ full-rank numpy array

Show your function definitions and the result of running the following code, which checks your functions on random hyperplanes. (Uncomment the code before running it and after defining your functions.)

In [11]:
# from checks import check_representations

# success = check_representations(
#     implicit_to_parametric, parametric_to_implicit
# )
# print('checks {}'.format('passed' if success else 'failed'))

Programming Notes¶

  • The code expects the argument s to bases to be a matrix in all cases, that is, a two-dimensional array, while $\tilde{\mathbf{a}}$ is a vector, that is, a one-dimensional array. To fix this discrepancy you can say

      s = np.expand_dims(a, axis=1)
    
    

    Similar considerations hold when $\tilde{Q}$ happens to have a single column (that is, when $d = 2$).

  • The function bases is general in that it works even if the columns in s are linearly dependent. In that case, the function computes the rank r of s and returns matrices of appropriate sizes. In this assignment, you may assume that all $\tilde{Q}$ matrices are $d\times (d-1)$ and full rank, and therefore $r = d-1$.
  • Do not change the default value of the tolerance argument to bases.
  • You do not need to know how check_representations works. However, if you get a failure message or something does not work for you you may want to inspect that code in the file checks.py you retrieved when you started working on this assignment.
  • If check_representations fails, do not automatically assume that it has a bug. Chances are the dimensions of some of your vectors or matrices are not right. Of course, bugs are always possible, but we checked our code on more than 2000 examples.