Skip to content

ML-As-1

Vector Calculus Review (15 pts)

a.Show x(xTc)=cT

Σi=1i=nxTcix=[Σi=1i=nxTc1x1,Σi=1i=nxTc2x2,,Σi=1i=nxTcnxn]=[c1,c2,,cn]=cT

b.Show x||x||22=2xT

Σi=1i=nxi2x=[Σi=1i=nxi2x1,Σi=1i=nxi2x2,,Σi=1i=nxi2xn]=[2x1,2x2,,2xn]=2xT

c.Show x(Ax)=A

x(Ax)=[Axx1,Axx2,,Axxn]=[(A11x1+A12x2++A1nxn)x1(A11x1+A12x2++A1nxn)x2(A11x1+A12x2++A1nxn)xn(A21x1+A22x2++A2nxn)x1(A21x1+A22x2++A2nxn)x2(A21x1+A22x2++A2nxn)xn(An1x1+An2x2++Annxn)x1(An1x1+An2x2++Annxn)x2(An1x1+An2x2++Annxn)xn]=[A11A12A1nA21A22A2nAn1An2Ann]=A

d.Show x(xTAx)=xT(A+AT)

x(Ax)=[xTAxx1,xTAxx2,,xTAxxn]=[x1(x1(A11x1+A12x2++A1nxn)+x2(A21x1+A22x2++A2nxn)++xn(An1x1+An2x2++Annxn))x2(x1(A11x1+A12x2++A1nxn)+x2(A21x1+A22x2++A2nxn)++xn(An1x1+An2x2++Annxn))xn(x1(A11x1+A12x2++A1nxn)+x2(A21x1+A22x2++A2nxn)++xn(An1x1+An2x2++Annxn))]T=[2A11x1+(A12+A21)x2++(A1n+An1)xn(A12+A21)x1+2A22x2++(A2n+An2)xn(A1n+An1)x1+(A2n+An2)x2++2Annxn]T=[A11x1+A12x2++A1nxnA21x1+A22x2++A2nxnAn1x1+An2x2++Annxn]+[A11x1+A21x2++An1xnA12x1+A22x2++An2xnA1nx1+A2nx2++Annxn]=xT(A+AT)

e.Under what condition is the previous derivative equal to 2xTA?

When A=AT, A is a symmetric matrix.

Bayes’ Rule (10 pts)

Assume the probability of a certain disease is 0.01. The probability of test positive given that a person is infected with the disease is 0.95 and the probability of test positive given the person is not infected with the disease is 0.05.

(a) Calculate the probability of test positive.

  • P(D)=0.01: The probability that a person has the disease.
  • P(T+|D)=0.95: The probability of testing positive given that the person has the disease.
  • P(T+|¬D)=0.05: The probability of testing positive given that the person does not have the disease.
  • P(¬D)=1P(D)=0.99: The probability that a person does not have the disease.
P(T+)=P(T+|D)P(D)+P(T+|¬D)P(¬D)P(T+)=(0.95)(0.01)+(0.05)(0.99)=0.059

(b) Use Bayes’ Rule to calculate the probability of being infected with the disease given that the test is positive.

P(D|T+)=P(T+|D)P(D)P(T+)=(0.95)(0.01)0.059=0.161

Gradient Descent Mechanics (20 pts)

Gradient descent is the primary algorithm to search optimal parameters for our models. Typically, we want to solve optimization problems stated as

minθΘL(fθ,D)

where L are differentiable functions. In this example, we look at a simple supervised learning problem where given a dataset D={(xi,yi)}i=1n, we want to find the optimal parameters θ that minimizes some loss. We consider different models for learning the mapping from input to output, and examine the behavior of gradient descent for each model.

a

The simplest parametric model entails learning a single-parameter constant function, = 𝜃. where we set y^i=θ. We wish to find

θ^const=minθRL(fθ,D)=minθR1NΣi=1N(yiθ)2

i. What is the gradient of L with respect to θ?

L(θ)=1Ni=1N(yiθ)2θL(θ)=1Ni=1N2(yiθ)(1)θL(θ)=2Ni=1N(yiθ)

ii. What is the optimal value of θ?

when

2Ni=1N(yiθ)=0i=1N(yiθ)=0i=1Nyi=Nθθ=1Ni=1Nyi

iii. Write the gradient descent update.

θ(t+1)=θ(t)ηθL(θ(t))θ(t+1)=θ(t)+η2Ni=1N(yiθ(t))

Where η is the learning rate.

iv. Stochastic Gradient Descent (SGD) is an alternative optimization algorithm, where instead of using all N samples, we use single sample per optimization step to update the model. What is the contribution of each data-point to the full gradient update?

θLi(θ)=2(yiθ)

Thus, the gradient descent update for a single data point is:

θ(t+1)=θ(t)+η2(yiθ(t))

In SGD, this single sample gradient update is used to update \theta after each data point.

b

Instead of constant functions, we now consider a single-parameter linear model yi^(xi)=θ(xi) where we search for θ such that

θ^=minθR1NΣi=1N(yiθxi)2

i. What is the gradient of L with respect to θ?

L(θ)=1Ni=1N(yiθxi)2θL(θ)=1Ni=1N2(θxi2yixi)θL(θ)=2Ni=1Nxi(yiθxi)

ii. What is the optimal value of θ?

when

2Ni=1N(θxi2yixi)=0i=1N(θxi2)=i=1Nyixiθ=i=1Nyixii=1Nxi2

iii. Write the gradient descent update.

θ(t+1)=θ(t)ηθL(θ(t))θ(t+1)=θ(t)+η2Ni=1N(yixiθ(t)xi2)

Where η is the learning rate.

iv. Do all points get the same vote in the update? Why or why not?

No, not all points get the same vote in the gradient update.

θL(θ)=2Ni=1Nxi(yiθxi)

Each data point is weighted by xi . If xi is large, that data point will have a larger influence on the gradient (and thus on the update), Whereas if xi is small, the influence will be smaller.

MAP Interpretation of Ridge Regression (20 pts)

Consider the Ridge Regression estimator

argminw||Xwy||2+λ||w||2

We know this is solved by

w^=(XT+λI)1XTy

One interpretation of Ridge Regression is to find the Maximum A Posteriori (MAP) estimate on w, the parameters, assuming that the prior of w is N(0,I) and that the random Y is generated using

Y=Xw+λN

Note that each entry of vector N is zero-mean, unit-variance normal. Show that w^=(XT+λI)1XTy is indeed the MAP estimate for w given an observation on Y=y .


The MAP estimate maximizes the posterior distribution:

wMAP=argmaxwlogP(w|Y)

From Bayes' rule:

P(w|Y)=P(Y|w)P(Y)P(w)P(Y|w)P(w)wMAP=argmaxwlogP(w|Y)=argmaxw(logP(Y|w)+logP(w))f(x|μ,σ2)=12πσ2exp((xμ)22σ2)

Substituting the likelihood and prior:

wMAP=12σ2XwY212ν2w2      λ=σ2ν2σ=1wMAP=12σ2XwY2λ2|w|2

Maximizing this expression with respect to w is equivalent to minimizing the following expression:

argminwXwY2+λw2w^=(XT+λI)1XTy

Programming (35 pts)

Task 1

python
# You should return your result. 

import numpy as np 

def insertSecond(a, b):
    return np.insert(a, 1, b)

assert np.array_equal(insertSecond(np.array([-5,-10,-12,-6]),5), np.array([-5, 5, -10, -12, -6]))
assert np.array_equal(insertSecond(np.array([1,2,3]),7), np.array([1, 7, 2, 3]))
assert np.array_equal(insertSecond(np.array([-5,-10,-12,-6]),8), np.array([ -5, 8, -10,-12, -6]))
assert np.array_equal(insertSecond(np.array([1,2,3]),12), np.array([1, 12, 2, 3]))

Task 2

python
import numpy as np 

def mergeArrays(a,b):
    return np.sort(np.unique(np.concatenate((a,b))))

# Test cases 
assert np.array_equal(mergeArrays(np.array([1,1,4,8,1]), np.array([2, 3])), np.array([1, 2, 3, 4, 8])) 
assert np.array_equal(mergeArrays(np.array([-5,-10,-10,-6]), np.array([-5, 8, -10, -12,-6])),np.array([-12, -10, -6, -5, 8]) )
assert np.array_equal(mergeArrays(np.array([1,1,6,8,1]), np.array([2, 3])), np.array([1, 2, 3, 6, 8]))

Task 3

python
import numpy as np
import matplotlib.pyplot as plt

# data to plot
n_groups = 5
men_means = (22, 30, 33, 30, 26)
women_means = (25, 32, 30, 35, 29)
alpha = 0.5

fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 0.4
opacity = 0.8

rects1 = plt.bar(index, men_means, bar_width,
alpha=0.5,
color='g',
label='Men')

rects2 = plt.bar(index + bar_width, women_means, bar_width,
alpha=0.5,
color='r',
label='Women')

plt.xlabel('Person')
plt.ylabel('Scores')
plt.title('Scores by person')
plt.xticks(index + bar_width / 2, ('G1', 'G2', 'G3', 'G4', 'G5'))
plt.legend()

plt.tight_layout()
plt.show()

output_5_0

Task 4

python
import pandas as pd


def setDataFrameZeros(df):
    rows = df.isin([0]).any(axis=1)
    cols = df.isin([0]).any(axis=0)
    df.loc[rows, :] = 0
    df.loc[:, cols] = 0
    return df

df1 = pd.DataFrame({'c1': [1, 4, 7], 'c2': [2, 0, 8], 'c3': [3, 6, 9]})
df2 = pd.DataFrame({'c1': [1, 0, 7], 'c2': [0, 0, 0], 'c3': [3, 0, 9]})
(df2.equals(setDataFrameZeros(df1)))

df1 = pd.DataFrame({'c1': [0, 3, 1], 'c2': [1, 4, 3], 'c3': [2, 5, 1], 'c4': [0, 2, 5]})
df2 = pd.DataFrame({'c1': [0, 0, 0], 'c2': [0, 4, 3], 'c3': [0, 5, 1], 'c4': [0, 0, 0]})
assert (df2.equals(setDataFrameZeros(df1)))

df1 = pd.DataFrame({'c1': [1, 4, 7], 'c2': [2, 0, 8], 'c3': [3, 6, 9]})
df2 = pd.DataFrame({'c1': [1, 0, 7], 'c2': [0, 0, 0], 'c3': [3, 0, 9]})
assert (df2.equals(setDataFrameZeros(df1)))

df1 = pd.DataFrame({'c1': [0, 3, 1], 'c2': [1, 4, 3], 'c3': [2, 5, 1], 'c4': [0, 2, 5]})
df2 = pd.DataFrame({
    'c1': [0, 0, 0],
    'c2': [0, 4, 3],
    'c3': [0, 5, 1],
    'c4': [0, 0, 0]
})
assert (df2.equals(setDataFrameZeros(df1)))