When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.
Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.
As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.
Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?
This question requires no explicit answer from you, and no space is provided for an answer in the template file.
However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.
Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.
Run the code in the following two cells before you start working on this assignment. It will retrieve two Python files that are necessary to do the work.
from urllib.request import urlretrieve
from os import path as osp
def retrieve(file_name, semester='fall22', course='371', homework=4):
if osp.exists(file_name):
print('Using previously downloaded file {}'.format(file_name))
else:
fmt = 'https://www2.cs.duke.edu/courses/{}/compsci{}/homework/{}/{}'
url = fmt.format(semester, course, homework, file_name)
urlretrieve(url, file_name)
print('Downloaded file {}'.format(file_name))
for file_name in ('mlp.py', 'checks.py'):
retrieve(file_name)
Using previously downloaded file mlp.py Using previously downloaded file checks.py
We are looking for a local minimum of the function $\mathbb{R} \rightarrow \mathbb{R}$ defined as
$$ y = f(\alpha) = \sin\left(\frac{\pi \alpha + 1}{2}\right)\;, $$starting at $\alpha_0 = 0$.
Since the function is univariate, we can just use line search.
The bracketing triple at some point during line search happens to be
$$ (a, b, c) = (0, 2, 5) \;. $$Some of the problems in this part ask you to calculate values of functions. You may use any calculator you want, and Python is an obvious choice. However, please do not include any code you may use to compute these values, except when you are explicitly asked to do so.
When reporting numerical values, please give four decimal digits after the period.
Write Python code to plot the function $f(\alpha)$ for $\alpha\in [-2, 6]$, place red dots at the points $(a, f(a))$, $(b, f(b))$, $(c, f(c))$, and print out the values $f(a)$, $f(b)$, $f(c)$ in the format specified above.
Show your code and your plot and printout. No need to label the axes. Specify which value is which in your printout.
Output Format
Image of the plot - lineplot with red dots as mentioned above
f(a) = x.xxxx, f(b) = x.xxxx, f(c) = x.xxxx
Prove that the triple $(a, b, c)$ given above is indeed a valid bracketing triple. Please remember that there are conditions on both $\alpha$ and $f(\alpha)$.
Give the values of the bracketing triple $(a', b', c')$ that the line search algorithm discussed in the class notes computes immediately from the triple given above (no golden-ratio search). Explain your reasoning briefly using the notation given in the class notes.
If line search is continued from the values given above, will it eventually converge to the local minimum of $f$ that is closest to the starting point $\alpha_0 = 0$? Explain briefly. Either way, give an exact expression and approximate numerical value for the point $\alpha^{\ast}$ that line search will converge to (after infinitely many iterations), and the corresponding function value $f(\alpha^{\ast})$.
Stochastic Gradient Descent (SGD) is used to minimize functions $f(\mathbf{z})$ where $\mathbf{z}$ lives in a large space and $f$ is an average of many other functions:
$$ f(\mathbf{z}) = \frac{1}{N} \sum_{n=1}^N \phi_n(\mathbf{z}) \;. $$A prime example of this type of functions is the risk function defined for a machine learning problem. In this case, $\mathbf{z}$ is a vector that contains all the parameters in the ML predictor $h(\mathbf{x}; \mathbf{z})$. As discussed in class, the function $\phi_n$ is the loss that the predictor $h$ incurs on sample number $n$ in a training set $T$ with $N$ samples.
The code in the file mlp.py
you downloaded earlier contains such a predictor, namely, a Multi-Layer Perceptron, which is a simple example of a neural network. It has $m = 7510$ parameters, so the variable $\mathbf{z}$ is a vector with that many entries.
The code also retrieves $N = 1797$ images of handwritten digits from the MNIST database (dscribed in class) and the corresponding labels. The images are tiny, just $8\times 8$ black-and-white pixels, for a total of $d = 64$ numbers per image. These are stored in the $N\times d$ array x_digits
imported below. The corresponding labels (a label is a digit between 0 and 9 inclusive) are stored in the $N$-dimensional vector y_digits
, also imported below.
You need not know what a MLP is or how it works. All you care about is that before you use it you need to reset it as shown in the code cell below. The reset_network
function returns a brand new network and a starting point $\mathbf{z}_0$ for SGD.
Important: If you run two separate experiments, you need to call reset_network
before each experiment, or else the network will remember what it learned in the previous experiments.
from mlp import x_digits, y_digits, reset_network, risk_and_gradient, plot_risk_curves
network, z0 = reset_network()
If you run the commands above and get error messages, chances are that you are using an old version of scikit-learn
. Make sure you update it with
conda update -n compsci371 -c conda-forge scikit-learn
and then restart jupyter
from the compsci371
environment.
The following command should print version number 1.1.1
or later.
import sklearn
print(sklearn.__version__)
1.1.1
The risk function $f(\mathbf{z})$ has a few twists, so you cannot just call it with argument $\mathbf{z}$ and expect it to return the value of the risk.
Thus, the function $f$, which is called risk_and_gradient
in the code, is used as follows:
risk, g = risk_and_gradient(z, x_batch, y_batch, network)
The first argument is initially z0
(returned by reset_network
above). The arguments x_batch
and y_batch
are just x_digits
and y_digits
if you want to compute the risk and its gradient on the entire training set. Otherwise, these arguments are corresponding slices of x_digits
and y_digits
. For instance, if you want to compute risk and gradient on the first ten training samples you would set
x_batch, y_batch = x_digits[:10], y_digits[:10]
The last argument network
is the MLP, which was returned by reset_network
above.
The returned values should be self-explanatory. The risk
is a positive floating-point number, which with enough training can be brought very close to zero, and the gradient g
is a numpy
vector with $m = 7510$ entries, one for each parameter of the MLP.
This part of the assignment is about SGD, not neural networks, so you can regard risk_and_gradient
as a black-box function $f$ that you want to minimize. However, if you want to know how well the MLP is doing on the training set after some amount of training, you can find its accuracy, that is the fraction of training examples it gets right, with the command
print('training accuracy: {}'.format(network.score(x_digits, y_digits)))
If the number printed is, say, 0.983, then the MLP gets 98.3 percent of all $N = 1797$ training samples right. A lower value of the risk corresponds to a higher accuracy, but the relation between the two is not straightforward.
The gradient computed over a batch is an approximation of the true gradient, which is the gradient computed over the entire training set. Smaller batches obviously lead to less accurate approximations. To see this, write a function with header
def direction_spread(z, x, y, mlp, batch_size):
that computes the average spread in the orientations of the gradients computed over all the full batches in the training set x, y
.
A "full" batch is a batch that has exactly batch_size
samples. Suppose for instance that batch_size
is set to 100. Since the training set has $N=1797$ samples, we can form 17 full batches (since len(y_digits) // batch_size = 17
). The full batches are consecutive groups of batch_size
samples from x_digits, y_digits
. Thus, full batches are formed from samples with indices 0:batch_size
, batch_size:(2 * batch_size)
, and so forth. In the example, the last 97 samples in the training set are ignored.
If the true gradient over the training set is $g_t$, then we can compute tha angle in radians between $g_t$ and the gradient $g$ from a given batch as
$$ \omega = \arccos \frac{g_t}{\|g_t\|} \cdot \frac{g}{\|g\|} \;. $$The direction spread is the mean of these angles over all full batches.
Show your code and a listing of the direction spreads in degrees for the batch sizes in the following sequence and then comment very briefly on the results.
batch_sizes = (5, 10, 20, 50, 100, 200, 500, 1000)
z
and mlp
set to the vector of initial parameters z0
and the newtork network
returned by reset_network()
. Use x_digits
and y_digits
for x
and y
.Output Format
batch size xxxx, spread xx.xxx degrees
batch size xxxx, spread xx.xxx degrees
...
...
In this problem you wil implement a basic version of SGD. When compared to a "production quality" version, your version will have both something missing and something extra.
Missing will be fancy acceleration techniques suuch as variable learning rates, momentum, and so forth. Your code will only use a single, fixed value for the learning rate.
Extra will be some ways to track progress in a much more detailed way than would be feasible with a larger training problem or neural network.
Your function definition should start as follows, so we are all on the same page:
def sgd(z, x, y, mlp,
n_epochs=5, batch_size=200,
epoch_alpha=1., record=False):
rng = np.random.default_rng(0)
n_samples = len(y)
# Possibly miss a few data points at the end for simplicity
n_batches = n_samples // batch_size
alpha = epoch_alpha / n_batches
step = 0
if record:
training_risks = [(step, risk_and_gradient(z, x, y, mlp)[0])]
batch_risks = []
else:
training_risks, batch_risks = None, None
for epoch in range(n_epochs):
indices = rng.permutation(n_samples)
b_start, b_stop = 0, batch_size
for b in range(n_batches):
step += 1
batch_indices = indices[b_start:b_stop]
x_batch, y_batch = x[batch_indices], y[batch_indices]
A few things to note:
rng = np.random.default_rng(0
) so that all our experiments will look similar. You would not do this in production-quality code.n_batches
shows that we are only using full batches, as defined in the previous problem. This is OK, because the first instruction in a new epoch scrambles the data (indices = rng.permuation(n_samples)
), so training samples missed in one epoch are likely to show up in another.alpha
is proportional to the size of a batch (because of the definition alpha = epoch_alpha / n_batches
). This makes experiments with different learning rates more comparable to each other, since in this way the distance traveled over one epoch is on average the same regardless of batch size, given that all batch gradients are equal on average to the true gradient.if record
block in the code above initializes two lists of risk values, as described next. Part of your job will be to add further statements under other if record
clauses to update these lists appropriately.training_risks
will contain the value of the risk over the whole training set at the beginning (step = 0
) and after each step (that is, after each batch is processed). It would be catastrophic to compute these values for larger training sets or neural networks. After all, the whole point of SGD is to avoid running computations on the whole training set all at once. On the other hand, recording these risks for our small experiments gives an unusual and interesting view of how SGD behaves, that is, how much progress it actually makes in fitting the neural network to the entire training set.batch_risks
will contain the value of the risk over the current batch both before and after that batch is processed. This shows how much progress SGD makes on each batch, rather than on the entire training set. The batch_risks
are really what SGD "sees," since it processes one batch at a time.Your job is to complete the code of sgd
. You will then reset the network (network, z0 = reset_network()
), call sgd on the values that reset_network
returns and on the training set x_digits
, y_digits
and with different combinations of values for the optional arguments. You will then call the provided function plot_risk_curves
(imported earlier from mlp.py
) to plot the training_risks
and batch_risks
.
In this first experiment, use batch_size=200, epoch_alpha=1., n_epochs=5, record=True
.
For all of this to work it is important that you follow the programming notes below accurately.
From a substantive point of view (that is, apart from keeping track of progress if record
is True
) there are only four lines missing from the code above, as described next. Write these four lines first.
r_batch_before
and gradient g
on the current batch before the value of z
is updated.g
to update the parameter vector z
. This is the actual SGD step.b_start
and b_stop
at the end of each step to move to the next batch.z
and the two lists training_risks
and batch_risks
.To keep track of progress, add an if record
block after the line that updates z
. This block records the following risk values:
r_batch_after
is the risk on the current batch after z
has been updated.r_training_after
is the risk on the entire training set after z
has been updated.((step - 1, r_batch_before), (step, r_batch_after))
is appended to batch_risks
.(step, r_training_after)
is appended to training_risks
.The function plot_risk_curves
you imported earlier has the following header:
def plot_risk_curves(training_risks, batch_risks, title=None, fs=16):
When you call it, leave fs
(font size) to its default value. The title
should indicate the values of n_epochs
, batch_size
, and epoch_alpha
used in the experiment.
In later problems you will repeat the experiment in this problem with different values for epoch_alpha
and batch_size
, so encapsulate all the necessary code into a function with header
def sgd_experiment(batch_size, epoch_alpha, n_epochs, record):
sgd_experiment
to sgd
. The function sgd_experiment
should also print out the accuracy (as explained just before Problem 2.1) and the final training risk over the entire training set. You will find this value in the appropriate place in training_risks
. Display three decimal digits after the period for all real-valued quantities.sgd_experiment
should also reset the network.sgd_experiment
always calls sgd
with its argument record
set to its default value of True
. You may want to try out sgd
by itself with record=False
just to make sure you don't have any bugs. Do not show this test in what you turn in.The call for this problem is
sgd_experiment(batch_size=200, epoch_alpha=1., n_epochs=5)
Here are three observations on the plot you made in the previous problem (assuming the plot is correct):
Explain each of these observations briefly in this order.
Show the result of running
sgd_experiment(batch_size=10, epoch_alpha=1., n_epochs=5)
This will take several seconds to run because of the many evaluations of the training risk needed for the plots.
Then comment on differences and similarites relative to the plot in the previous problem. Comment in particular on the final values of risk achieved, on the variance of the batch risks, and on the steadiness in the decrease of the training risk. Explain whenever you know how.
Show the results of running
sgd_experiment(batch_size=200, epoch_alpha=.01, n_epochs=5)
sgd_experiment(batch_size=200, epoch_alpha=50., n_epochs=5)
in separate notebook cells and explain briefly what you see.
Show the result of running
sgd_experiment(batch_size=200, epoch_alpha=1., n_epochs=80)
Is the final predictor accurate? Is it guaranteed to generalize well? Explain very briefly
Let $d$ be a positive integer. A hyperplane in $\mathbb{R}^d$ is the solution to a linear equation $\tilde{\mathbf{a}}^T\mathbf{z} = \tilde{b}$ with nonzero vector $\tilde{\mathbf{a}}\in\mathbb{R}^d$.
We use tildes because we will soon transform both $\tilde{\mathbf{a}}$ and $\tilde{b}$ to make it easier to work with them.
Examples:
The obvious representation of a hyperplane is the one given in its definition, that is, the parameters of the equation $\tilde{\mathbf{a}}^T\mathbf{z} = \tilde{b}$. In a computer program you would store the pair $(\tilde{\mathbf{a}}, \tilde{b})$. This is called the implicit representation of the hyperplane. It is most useful when one is given a point $\mathbf{z}_0$ and needs to check if it belongs to some hyperplane: Just compute $\tilde{\mathbf{a}}^T\mathbf{z}_0$ and verify if it is equal to $\tilde{b}$.
There are infinitely many choices for $\tilde{\mathbf{a}}$ and $\tilde{\mathbf{b}}$ for the same hyperplane. To make the implicit representation a bit more canonical it is customary to choose $\tilde{\mathbf{a}}$ to have unit norm. The value of $\tilde{b}$ must then be modified accordingly. Of course, there are two unit-norm vectors for each $\tilde{\mathbf{a}}$, and that's why this new representation $(\mathbf{a}, b)$ is only "nearly" canonical.
For brevity, we will use the term "canonical" instead of "near canonical" for $(\mathbf{a}, b)$.
In some cases it is necessary to generate points on the hyperplane. For instance, to plot part of a line you typically generate at least two points on it and connect them with a line segment. In those cases, a parametric representation is more convenient. This is in the form
$$ \mathbf{z} = \tilde{\mathbf{p}} + \tilde{Q} \boldsymbol{\alpha} $$where $\tilde{\mathbf{p}}$ is any point on the hyperplane, the columns in the $d\times (d-1)$ matrix $\tilde{Q}$ form a basis for a linear space parallel to the hyperplane (and $\tilde{Q}$ is therefore full rank), and $\boldsymbol{\alpha}$ is a vector of $d-1$ real numbers. Then, any choice of $\boldsymbol{\alpha}$ leads to some point $\mathbf{z}$ on the hyperplane.
There are inifitely many choices for $\tilde{\mathbf{p}}$ and $\tilde{Q}$ for the same hyperplane. To make the parametric representation a bit more canonical it is customary to choose $\tilde{\mathbf{p}}$ as the point $\mathbf{p}$ on the hyperplane that is closest to the origin, and to replace $\tilde{Q}$ with a matrix $Q$ with orthonormal columns. Even so, there are of course many such matrices.
For brevity, we will use the term "canonical" instead of "near canonical" for $(\mathbf{p}, Q)$.
Let us consider hyperplanes on the plane, so that $d=2$ and the hyperplanes are lines. Answer the following questions for the line on the plane that solves the equation
$$ \tilde{\mathbf{a}}^T \mathbf{z} = \tilde{b} $$where $\tilde{\mathbf{a}}$ is an arbitrary nonzero vector. Give very brief explanations.
Let us now do the reverse, still for $d = 2$. Specifically, given a non-canonical parametric representation for a line in the form
$$ \mathbf{z} = \tilde{\mathbf{p}} + \tilde{\mathbf{q}}\alpha $$answer the following questions, giving very brief explanations.
Write a function with header
def implicit_to_parametric_2d(tilde_a, tilde_b):
that takes the (non-canonical) implicit representation $(\tilde{\mathbf{a}}, \tilde{b})$ of a line on the plane and returns a canonical parametric representation $(\mathbf{p}, \mathbf{q})$ of the same line.
The argument tilde_a
is a numpy
array of shape (2,)
and tilde_b
is a floating point number. You may assume without checking that tilde_a
is nonzero.
Show your code. Then for each line in the list implicit_lines
defined below print out the inputs and outputs of implicit_to_parametric_2d
. Your printout should make it clear which parameter is which, and print all values with up to three decimal digits after the period.
import numpy as np
implicit_lines = [
(np.array((0., 2)), 6.),
(np.array((3., -3.)), 9.),
(np.array((1., 2.)), 1.)
]
In a later problem you will be asked to write a function that checks if a parametric and an implicit representation represent the same hyperplane. This will let you go back here and verify that your outputs are correct. On the other hand, the test lines provided here are simple enough that you should be able to verify correctness by hand. You need not show the outcome of this verification here.
Write a function with header
def parametric_to_implicit_2d(tilde_p, tilde_q):
that takes the (non-canonical) parametric representation $(\tilde{\mathbf{p}}, \tilde{\mathbf{q}})$ of a line on the plane and returns a canonical implicit representation $(\mathbf{a}, b)$ of the same line.
The arguments tilde_p
and tilde_q
are numpy
arrays of shape (2,)
. You may assume without checking that tilde_q
is nonzero.
Show your code. Then for each line in the list parametric_lines
defined below print out the inputs and outputs of parametric_to_implicit_2d
. Your printout should make it clear which parameter is which, and print all values with up to three decimal digits after the period.
The Note in the previous problem applies here as well.
parametric_lines = [
(np.array((-1., -1.)), np.array((2., 1.))),
(np.array((0., 0.)), np.array((np.sqrt(3.), 1.))),
(np.array((2., -3.)), np.array((1., 0.)))
]
Hopefully the problems above gave you a concrete sense for how these things work. We'll now use those intuitions to write conversion functions that work in any number of dimensions $d = 2, 3, \ldots$. This will be done by numerical methods, as manual algebraic calculations won't carry us far in the general case.
Incidentally, there is nothing wrong with $d = 1$. Ignoring that case here just leads to minor simplifications to the code, since in $\mathbb{R}^1$ some of the relevant spaces are empty.
The key issue in arbitrary numbers of dimensions is that there isn't just one vector orthogonal to the vector $\tilde{\mathbf{a}}\in\mathbb{R}^d$, but a whole $d-1$-dimensional space of them. For instance, when $d = 3$, the space of vectors orthogonal to $\tilde{\mathbf{a}}$ is spanned by two vectors, not just one.
In any number of dimensions, the space spanned by $\tilde{\mathbf{a}}$ and the space of all vectors orthogonal to $\tilde{\mathbf{a}}$ are the orthogonal complements to each other in $\mathbb{R}^d$. So now instead of a single vector $\tilde{\mathbf{q}}$ there is a $d\times (d-1)$ matrix $\tilde{Q}$ whose columns span the orthogonal complement to the space spanned by $\tilde{\mathbf{a}}$.
The implicit representation is the same as before, just with longer vectors:
$$ \tilde{\mathbf{a}}^T \mathbf{z} = b $$while the parametric representation is now
$$ \mathbf{z} = \tilde{\mathbf{p}} + Q \boldsymbol{\alpha} $$where $Q$ is $d\times (d-1)$ and $\boldsymbol{\alpha}$ is a vector of $d-1$ numbers.
Suppose that the columns of some $d\times k$ matrix $S$ with rank $r$ form a basis (orthonormal or otherwise) for some $r$-dimensional linear subspace of $\mathbb{R}^d$. Then the Singular Value Decomposition (SVD) of $S$ yields easily the following:
This computation is encapsulated in the following function:
import numpy as np
def bases(s, tolerance=1.e-6):
assert s.shape[0] >= s.shape[1], \
'input matrix must have no fewer rows than columns'
u, sigma, _ = np.linalg.svd(s)
threshold = sigma[0] * s.shape[0] * tolerance
rank = np.sum(sigma > threshold)
m, n = u[:, :rank], u[:, rank:]
return m, n, rank
Here is a usage example for a random $5\times 3$ matrix s
of rank 2. Make sure you understand what is going on, especially with the output matrices below that contain all or mostly zeros.
dd, kk, rr = 5, 3, 2
# left and right are random and therefore full rank with probability 1
left = np.random.rand(dd, rr)
right = np.random.rand(rr, kk)
ss = np.dot(left, right) # This is a way to make a rank-rr matrix
mm, nn, rr = bases(ss)
with np.printoptions(precision=3, suppress=True):
print('M:', mm, sep='\n', end='\n\n')
print('M^T M:', np.dot(mm.T, mm), sep='\n', end='\n\n')
print('N:', nn, sep='\n', end='\n\n')
print('N^T N:', np.dot(nn.T, nn), sep='\n', end='\n\n')
print('M^T N:', np.dot(mm.T, nn), sep='\n', end='\n\n')
print('rank {}'.format(rr))
M: [[-0.516 -0.453] [-0.361 0.773] [-0.629 -0.305] [-0.419 0.31 ] [-0.178 0.097]] M^T M: [[ 1. -0.] [-0. 1.]] N: [[ 0.458 0.564 0.015] [ 0.497 -0.11 -0.117] [-0.166 -0.681 -0.137] [-0.715 0.449 -0.125] [-0.063 -0.061 0.975]] N^T N: [[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]] M^T N: [[-0. -0. 0.] [-0. -0. -0.]] rank 2
For our problem we can set $S$ (or s
in the code) to $\tilde{\mathbf{a}}$ when we want to find $Q$ from $\tilde{\mathbf{a}}$, or set $S$ to $\tilde{Q}$ if we want to go the other way around, from $\tilde{Q}$ to $\mathbf{a}$.
In both cases, the answer is $N$. The matrix $M$ is $\mathbf{a}$ in the first case and $Q$ in the second. You won't need the third return value (rank
), unless you want to check things for yourself.
Other than computing orthogonal spaces, the rest of the math is more or less the same in arbitrary $d \geq 2$ as it was for $d = 2$. However, you need to pay attention to array shapes and transposes to make sure everything adds and multiplies correctly. Working out a small example by hand for both $d = 2$ and $d = 3$ may help.
Write two functions with the following headers:
def implicit_to_parametric(tilde_a, tilde_b):
def parametric_to_implicit(tilde_p, tilde_q):
The first two functions behave like the corresponding 2D versions, but work for dimensions 2 or greater. In terms of shapes,
tilde_a
is a numpy
vector with $d$ entriestilde_b
is a floating point numbertilde_p
is a numpy
vector with $d$ entriestilde_q
is a $d\times (d-1)$ full-rank numpy
arrayShow your function definitions and the result of running the following code, which checks your functions on random hyperplanes. (Uncomment the code before running it and after defining your functions.)
# from checks import check_representations
# success = check_representations(
# implicit_to_parametric, parametric_to_implicit
# )
# print('checks {}'.format('passed' if success else 'failed'))
The code expects the argument s
to bases
to be a matrix in all cases, that is, a two-dimensional array, while $\tilde{\mathbf{a}}$ is a vector, that is, a one-dimensional array. To fix this discrepancy you can say
s = np.expand_dims(a, axis=1)
Similar considerations hold when $\tilde{Q}$ happens to have a single column (that is, when $d = 2$).
bases
is general in that it works even if the columns in s
are linearly dependent. In that case, the function computes the rank r
of s
and returns matrices of appropriate sizes. In this assignment, you may assume that all $\tilde{Q}$ matrices are $d\times (d-1)$ and full rank, and therefore $r = d-1$.tolerance
argument to bases
.check_representations
works. However, if you get a failure message or something does not work for you you may want to inspect that code in the file checks.py
you retrieved when you started working on this assignment.check_representations
fails, do not automatically assume that it has a bug. Chances are the dimensions of some of your vectors or matrices are not right. Of course, bugs are always possible, but we checked our code on more than 2000 examples.