ML-As-1

Vector Calculus Review (15 pts)

$a . S h o w \frac{\partial}{\partial x} (x^{T} c) = c^{T}$

$\frac{\partial Σ_{i = 1}^{i = n} x^{T} c_{i}}{\partial x} = [\frac{\partial Σ_{i = 1}^{i = n} x^{T} c_{1}}{\partial x_{1}}, \frac{\partial Σ_{i = 1}^{i = n} x^{T} c_{2}}{\partial x_{2}}, \dots, \frac{\partial Σ_{i = 1}^{i = n} x^{T} c_{n}}{\partial x_{n}}] = [c_{1}, c_{2}, \dots, c_{n}] = c^{T}$

$b . S h o w \frac{\partial}{\partial x} | | x | |_{2}^{2} = 2 x^{T}$

$\frac{\partial Σ_{i = 1}^{i = n} x_{i}^{2}}{\partial x} = [\frac{\partial Σ_{i = 1}^{i = n} x_{i}^{2}}{\partial x_{1}}, \frac{\partial Σ_{i = 1}^{i = n} x_{i}^{2}}{\partial x_{2}}, \dots, \frac{\partial Σ_{i = 1}^{i = n} x_{i}^{2}}{\partial x_{n}}] = [2 x_{1}, 2 x_{2}, \dots, 2 x_{n}] = 2 x^{T}$

$c . S h o w \frac{\partial}{\partial x} (A x) = A$

\frac{\partial}{\partial x} (A x) = [\frac{\partial A x}{\partial x_{1}}, \frac{\partial A x}{\partial x_{2}}, \dots, \frac{\partial A x}{\partial x_{n}}]

= [\begin{matrix} \frac{\partial (A_{11} x_{1} + A_{12} x_{2} + \dots + A_{1 n} x_{n})}{\partial x_{1}} & \frac{\partial (A_{11} x_{1} + A_{12} x_{2} + \dots + A_{1 n} x_{n})}{\partial x_{2}} & \dots & \frac{\partial (A_{11} x_{1} + A_{12} x_{2} + \dots + A_{1 n} x_{n})}{\partial x_{n}} \\ \frac{\partial (A_{21} x_{1} + A_{22} x_{2} + \dots + A_{2 n} x_{n})}{\partial x_{1}} & \frac{\partial (A_{21} x_{1} + A_{22} x_{2} + \dots + A_{2 n} x_{n})}{\partial x_{2}} & \dots & \frac{\partial (A_{21} x_{1} + A_{22} x_{2} + \dots + A_{2 n} x_{n})}{\partial x_{n}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial (A_{n 1} x_{1} + A_{n 2} x_{2} + \dots + A_{n n} x_{n})}{\partial x_{1}} & \frac{\partial (A_{n 1} x_{1} + A_{n 2} x_{2} + \dots + A_{n n} x_{n})}{\partial x_{2}} & \dots & \frac{\partial (A_{n 1} x_{1} + A_{n 2} x_{2} + \dots + A_{n n} x_{n})}{\partial x_{n}} \end{matrix}]

= [\begin{matrix} A_{11} & A_{12} & \dots & A_{1 n} \\ A_{21} & A_{22} & \dots & A_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ A_{n 1} & A_{n 2} & \dots & A_{n n} \end{matrix}] = A

$d . S h o w \frac{\partial}{\partial x} (x^{T} A x) = x^{T} (A + A^{T})$

\frac{\partial}{\partial x} (A x) = [\frac{\partial x^{T} A x}{\partial x_{1}}, \frac{\partial x^{T} A x}{\partial x_{2}}, \dots, \frac{\partial x^{T} A x}{\partial x_{n}}]

= {[\begin{matrix} \frac{\partial}{\partial x_{1}} (x_{1} (A_{11} x_{1} + A_{12} x_{2} + \dots + A_{1 n} x_{n}) + x_{2} (A_{21} x_{1} + A_{22} x_{2} + \dots + A_{2 n} x_{n}) + \dots + x_{n} (A_{n 1} x_{1} + A_{n 2} x_{2} + \dots + A_{n n} x_{n})) \\ \frac{\partial}{\partial x_{2}} (x_{1} (A_{11} x_{1} + A_{12} x_{2} + \dots + A_{1 n} x_{n}) + x_{2} (A_{21} x_{1} + A_{22} x_{2} + \dots + A_{2 n} x_{n}) + \dots + x_{n} (A_{n 1} x_{1} + A_{n 2} x_{2} + \dots + A_{n n} x_{n})) \\ ⋮ \\ \frac{\partial}{\partial x_{n}} (x_{1} (A_{11} x_{1} + A_{12} x_{2} + \dots + A_{1 n} x_{n}) + x_{2} (A_{21} x_{1} + A_{22} x_{2} + \dots + A_{2 n} x_{n}) + \dots + x_{n} (A_{n 1} x_{1} + A_{n 2} x_{2} + \dots + A_{n n} x_{n})) \end{matrix}]}^{T}

= {[\begin{matrix} 2 A_{11} x_{1} + (A_{12} + A_{21}) x_{2} + \dots + (A_{1 n} + A_{n 1}) x_{n} \\ (A_{12} + A_{21}) x_{1} + 2 A_{22} x_{2} + \dots + (A_{2 n} + A_{n 2}) x_{n} \\ ⋮ \\ (A_{1 n} + A_{n 1}) x_{1} + (A_{2 n} + A_{n 2}) x_{2} + \dots + 2 A_{n n} x_{n} \end{matrix}]}^{T}

= [\begin{matrix} A_{11} x_{1} + A_{12} x_{2} + \dots + A_{1 n} x_{n} \\ A_{21} x_{1} + A_{22} x_{2} + \dots + A_{2 n} x_{n} \\ ⋮ \\ A_{n 1} x_{1} + A_{n 2} x_{2} + \dots + A_{n n} x_{n} \end{matrix}] + [\begin{matrix} A_{11} x_{1} + A_{21} x_{2} + \dots + A_{n 1} x_{n} \\ A_{12} x_{1} + A_{22} x_{2} + \dots + A_{n 2} x_{n} \\ ⋮ \\ A_{1 n} x_{1} + A_{2 n} x_{2} + \dots + A_{n n} x_{n} \end{matrix}]

= x^{T} (A + A^{T})

$e . U n d e r w h a t c o n d i t i o n i s t h e p r e v i o u s d e r i v a t i v e e q u a l t o 2 x^{T} A ?$

When $A = A^{T}$ , $A$ is a symmetric matrix.

Bayes’ Rule (10 pts)

Assume the probability of a certain disease is $0.01$ . The probability of test positive given that a person is infected with the disease is $0.95$ and the probability of test positive given the person is not infected with the disease is $0.05$ .

(a) Calculate the probability of test positive.

$P (D) = 0.01$ : The probability that a person has the disease.
$P (T^{+} | D) = 0.95$ : The probability of testing positive given that the person has the disease.
$P (T^{+} | \neg D) = 0.05$ : The probability of testing positive given that the person does not have the disease.
$P (\neg D) = 1 - P (D) = 0.99$ : The probability that a person does not have the disease.

P (T^{+}) = P (T^{+} | D) P (D) + P (T^{+} | \neg D) P (\neg D)

P (T^{+}) = (0.95) (0.01) + (0.05) (0.99) = 0.059

(b) Use Bayes’ Rule to calculate the probability of being infected with the disease given that the test is positive.

P (D | T^{+}) = \frac{P (T^{+} | D) P (D)}{P (T^{+})} = \frac{(0.95) (0.01)}{0.059} = 0.161

Gradient Descent Mechanics (20 pts)

Gradient descent is the primary algorithm to search optimal parameters for our models. Typically, we want to solve optimization problems stated as

min_{θ \in Θ} L (f_{θ}, D)

where $L$ are differentiable functions. In this example, we look at a simple supervised learning problem where given a dataset $D = {(x_{i}, y_{i})}_{i = 1}^{n}$ , we want to find the optimal parameters $θ$ that minimizes some loss. We consider different models for learning the mapping from input to output, and examine the behavior of gradient descent for each model.

a

The simplest parametric model entails learning a single-parameter constant function, = 𝜃. where we set ${\hat{y}}_{i} = θ$ . We wish to find

{\hat{θ}}_{c o n s t} = m i n_{θ \in R} L (f_{θ}, D) = m i n_{θ \in R} \frac{1}{N} Σ_{i = 1}^{N} (y_{i} - θ)^{2}

i. What is the gradient of $L$ with respect to $θ$ ?

L (θ) = \frac{1}{N} \sum_{i = 1}^{N} (y_{i} - θ)^{2}

\frac{\partial}{\partial θ} L (θ) = \frac{1}{N} \sum_{i = 1}^{N} 2 (y_{i} - θ) (- 1)

\frac{\partial}{\partial θ} L (θ) = - \frac{2}{N} \sum_{i = 1}^{N} (y_{i} - θ)

ii. What is the optimal value of $θ$ ?

when

- \frac{2}{N} \sum_{i = 1}^{N} (y_{i} - θ) = 0

\sum_{i = 1}^{N} (y_{i} - θ) = 0

\sum_{i = 1}^{N} y_{i} = N θ

θ = \frac{1}{N} \sum_{i = 1}^{N} y_{i}

iii. Write the gradient descent update.

θ^{(t + 1)} = θ^{(t)} - η \nabla_{θ} L (θ^{(t)})

θ^{(t + 1)} = θ^{(t)} + η \frac{2}{N} \sum_{i = 1}^{N} (y_{i} - θ^{(t)})

Where $η$ is the learning rate.

iv. Stochastic Gradient Descent (SGD) is an alternative optimization algorithm, where instead of using all N samples, we use single sample per optimization step to update the model. What is the contribution of each data-point to the full gradient update?

\nabla_{θ} L_{i} (θ) = - 2 (y_{i} - θ)

Thus, the gradient descent update for a single data point is:

θ^{(t + 1)} = θ^{(t)} + η 2 (y_{i} - θ^{(t)})

In SGD, this single sample gradient update is used to update \theta after each data point.

b

Instead of constant functions, we now consider a single-parameter linear model $\hat{y_{i}} (x_{i}) = θ (x_{i})$ where we search for $θ$ such that

\hat{θ} = m i n_{θ \in R} \frac{1}{N} Σ_{i = 1}^{N} (y_{i} - θ x_{i})^{2}

i. What is the gradient of $L$ with respect to $θ$ ?

L (θ) = \frac{1}{N} \sum_{i = 1}^{N} (y_{i} - θ x_{i})^{2}

\frac{\partial}{\partial θ} L (θ) = - \frac{1}{N} \sum_{i = 1}^{N} 2 (θ x_{i}^{2} - y_{i} x_{i})

\frac{\partial}{\partial θ} L (θ) = - \frac{2}{N} \sum_{i = 1}^{N} x_{i} (y_{i} - θ x_{i})

ii. What is the optimal value of $θ$ ?

when

- \frac{2}{N} \sum_{i = 1}^{N} (θ x_{i}^{2} - y_{i} x_{i}) = 0

\sum_{i = 1}^{N} (θ x_{i}^{2}) = \sum_{i = 1}^{N} y_{i} x_{i}

θ = \frac{\sum_{i = 1}^{N} y_{i} x_{i}}{\sum_{i = 1}^{N} x_{i}^{2}}

iii. Write the gradient descent update.

θ^{(t + 1)} = θ^{(t)} - η \nabla_{θ} L (θ^{(t)})

θ^{(t + 1)} = θ^{(t)} + η \frac{2}{N} \sum_{i = 1}^{N} (y_{i} x_{i} - θ^{(t)} x_{i}^{2})

Where $η$ is the learning rate.

iv. Do all points get the same vote in the update? Why or why not?

No, not all points get the same vote in the gradient update.

\nabla_{θ} L (θ) = - \frac{2}{N} \sum_{i = 1}^{N} x_{i} (y_{i} - θ x_{i})

Each data point is weighted by $x_{i}$ . If $x_{i}$ is large, that data point will have a larger influence on the gradient (and thus on the update), Whereas if $x_{i}$ is small, the influence will be smaller.

MAP Interpretation of Ridge Regression (20 pts)

Consider the Ridge Regression estimator

a r g {m i n}_{w} | | X w - y | |^{2} + λ | | w | |^{2}

We know this is solved by

\hat{w} = (X^{T} + λ I)^{- 1} X^{T} y

One interpretation of Ridge Regression is to find the Maximum A Posteriori (MAP) estimate on $w$ , the parameters, assuming that the prior of $w$ is $N (0, I)$ and that the random $Y$ is generated using

Y = X w + λ N

Note that each entry of vector $N$ is zero-mean, unit-variance normal. Show that $\hat{w} = (X^{T} + λ I)^{- 1} X^{T} y$ is indeed the MAP estimate for $w$ given an observation on $Y = y$ .

The MAP estimate maximizes the posterior distribution:

w^{M A P} = \arg max_{w} \log P (w | Y)

From Bayes' rule:

P (w | Y) = \frac{P (Y | w)}{P (Y)} P (w) \propto P (Y | w) P (w)

w^{M A P} = \arg max_{w} \log P (w | Y) = \arg max_{w} (\log P (Y | w) + \log P (w))

f (x | μ, σ^{2}) = \frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{(x - μ)^{2}}{2 σ^{2}})

Substituting the likelihood and prior:

w^{M A P} = - \frac{1}{2 σ^{2}} ∥ X w - Y ∥^{2} - \frac{1}{2 ν^{2}} ∥ w ∥^{2} λ = \frac{σ^{2}}{ν^{2}}

∵ σ = 1 ∴ w^{M A P} = - \frac{1}{2 σ^{2}} ∥ X w - Y ∥^{2} - \frac{λ}{2} | w |^{2}

Maximizing this expression with respect to $w$ is equivalent to minimizing the following expression:

\arg min_{w} ∥ X w - Y ∥^{2} + λ ∥ w ∥^{2}

∴ \hat{w} = (X^{T} + λ I)^{- 1} X^{T} y

Programming (35 pts)

Task 1

python

# You should return your result. 

import numpy as np 

def insertSecond(a, b):
    return np.insert(a, 1, b)

assert np.array_equal(insertSecond(np.array([-5,-10,-12,-6]),5), np.array([-5, 5, -10, -12, -6]))
assert np.array_equal(insertSecond(np.array([1,2,3]),7), np.array([1, 7, 2, 3]))
assert np.array_equal(insertSecond(np.array([-5,-10,-12,-6]),8), np.array([ -5, 8, -10,-12, -6]))
assert np.array_equal(insertSecond(np.array([1,2,3]),12), np.array([1, 12, 2, 3]))

Task 2

python

import numpy as np 

def mergeArrays(a,b):
    return np.sort(np.unique(np.concatenate((a,b))))

# Test cases 
assert np.array_equal(mergeArrays(np.array([1,1,4,8,1]), np.array([2, 3])), np.array([1, 2, 3, 4, 8])) 
assert np.array_equal(mergeArrays(np.array([-5,-10,-10,-6]), np.array([-5, 8, -10, -12,-6])),np.array([-12, -10, -6, -5, 8]) )
assert np.array_equal(mergeArrays(np.array([1,1,6,8,1]), np.array([2, 3])), np.array([1, 2, 3, 6, 8]))

Task 3

python

import numpy as np
import matplotlib.pyplot as plt

# data to plot
n_groups = 5
men_means = (22, 30, 33, 30, 26)
women_means = (25, 32, 30, 35, 29)
alpha = 0.5

fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 0.4
opacity = 0.8

rects1 = plt.bar(index, men_means, bar_width,
alpha=0.5,
color='g',
label='Men')

rects2 = plt.bar(index + bar_width, women_means, bar_width,
alpha=0.5,
color='r',
label='Women')

plt.xlabel('Person')
plt.ylabel('Scores')
plt.title('Scores by person')
plt.xticks(index + bar_width / 2, ('G1', 'G2', 'G3', 'G4', 'G5'))
plt.legend()

plt.tight_layout()
plt.show()

output_5_0

Task 4

python

import pandas as pd


def setDataFrameZeros(df):
    rows = df.isin([0]).any(axis=1)
    cols = df.isin([0]).any(axis=0)
    df.loc[rows, :] = 0
    df.loc[:, cols] = 0
    return df

df1 = pd.DataFrame({'c1': [1, 4, 7], 'c2': [2, 0, 8], 'c3': [3, 6, 9]})
df2 = pd.DataFrame({'c1': [1, 0, 7], 'c2': [0, 0, 0], 'c3': [3, 0, 9]})
(df2.equals(setDataFrameZeros(df1)))

df1 = pd.DataFrame({'c1': [0, 3, 1], 'c2': [1, 4, 3], 'c3': [2, 5, 1], 'c4': [0, 2, 5]})
df2 = pd.DataFrame({'c1': [0, 0, 0], 'c2': [0, 4, 3], 'c3': [0, 5, 1], 'c4': [0, 0, 0]})
assert (df2.equals(setDataFrameZeros(df1)))

df1 = pd.DataFrame({'c1': [1, 4, 7], 'c2': [2, 0, 8], 'c3': [3, 6, 9]})
df2 = pd.DataFrame({'c1': [1, 0, 7], 'c2': [0, 0, 0], 'c3': [3, 0, 9]})
assert (df2.equals(setDataFrameZeros(df1)))

df1 = pd.DataFrame({'c1': [0, 3, 1], 'c2': [1, 4, 3], 'c3': [2, 5, 1], 'c4': [0, 2, 5]})
df2 = pd.DataFrame({
    'c1': [0, 0, 0],
    'c2': [0, 4, 3],
    'c3': [0, 5, 1],
    'c4': [0, 0, 0]
})
assert (df2.equals(setDataFrameZeros(df1)))

Algorithm

Tutorial

assignment

Assignment

As-1

As-2

Lab-1

Lab-2

Lab-3

Lab-4

GAMES101

Assignment-1

Assignment-2

Assignment-3

Assignment-4

Lab

Lecture

Peoject

CSCN

Ploidy

ML-As-1 ​

Vector Calculus Review (15 pts) ​

a.Show ∂∂x(xTc)=cT ​

b.Show ∂∂x||x||22=2xT ​

c.Show ∂∂x(Ax)=A ​

d.Show ∂∂x(xTAx)=xT(A+AT) ​

e.Under what condition is the previous derivative equal to 2xTA? ​

Bayes’ Rule (10 pts) ​

(a) Calculate the probability of test positive. ​

(b) Use Bayes’ Rule to calculate the probability of being infected with the disease given that the test is positive. ​

Gradient Descent Mechanics (20 pts) ​

a ​

i. What is the gradient of L with respect to θ? ​

ii. What is the optimal value of θ? ​

iii. Write the gradient descent update. ​

iv. Stochastic Gradient Descent (SGD) is an alternative optimization algorithm, where instead of using all N samples, we use single sample per optimization step to update the model. What is the contribution of each data-point to the full gradient update? ​

b ​

i. What is the gradient of L with respect to θ? ​

ii. What is the optimal value of θ? ​

iii. Write the gradient descent update. ​

iv. Do all points get the same vote in the update? Why or why not? ​

MAP Interpretation of Ridge Regression (20 pts) ​

Programming (35 pts) ​

Task 1 ​

Task 2 ​

Task 3 ​

Task 4 ​

ML-As-1

Vector Calculus Review (15 pts)

$a . S h o w \frac{\partial}{\partial x} (x^{T} c) = c^{T}$

$b . S h o w \frac{\partial}{\partial x} | | x | |_{2}^{2} = 2 x^{T}$

$c . S h o w \frac{\partial}{\partial x} (A x) = A$

$d . S h o w \frac{\partial}{\partial x} (x^{T} A x) = x^{T} (A + A^{T})$

$e . U n d e r w h a t c o n d i t i o n i s t h e p r e v i o u s d e r i v a t i v e e q u a l t o 2 x^{T} A ?$

Bayes’ Rule (10 pts)

(a) Calculate the probability of test positive.

(b) Use Bayes’ Rule to calculate the probability of being infected with the disease given that the test is positive.

Gradient Descent Mechanics (20 pts)

a

i. What is the gradient of $L$ with respect to $θ$ ?

ii. What is the optimal value of $θ$ ?

iii. Write the gradient descent update.

iv. Stochastic Gradient Descent (SGD) is an alternative optimization algorithm, where instead of using all N samples, we use single sample per optimization step to update the model. What is the contribution of each data-point to the full gradient update?

b

i. What is the gradient of $L$ with respect to $θ$ ?

ii. What is the optimal value of $θ$ ?

iii. Write the gradient descent update.

iv. Do all points get the same vote in the update? Why or why not?

MAP Interpretation of Ridge Regression (20 pts)

Programming (35 pts)

Task 1

Task 2

Task 3

Task 4