Binary Classifier with Logistic Regression (Deep Learning Notes C1W1-3)

These are some notes for the Coursera Course on Neural Networks and Deep Learning, which is a part of the Deep Learning Specialization. This post is a summary of the course contents that I learned from Week 1 to Week 3.

These notes are far from original. All credits go to the course instructor Andrew Ng and his team.


forward propagation

logistic regression is used to predict an output/prediction \(\hat{y}\) from an input \(\mathbf{x}\)

  • \(z = \mathbf{w}^T \mathbf{x} + b\) (weighted sum of the input plus a bias term)
  • \(a = g(z)\) (activation function to introduce non-linearity)
  • \(\hat{y} = a\) (predicted output)

where \(\displaystyle \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_{n_x} \end{bmatrix}, \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_{n_x} \end{bmatrix}\) and \(z, b \in \mathbb{R}\).

estimation of loss

we want to have \(\hat{y} = y\) (i.e., prediction by model is the same as the desired output)

  • mean square error (MSE): \(L(\hat{y}, y) = \frac{1}{2}(\hat{y} - y)^2\).

  • cross-entropy: \(L(\hat{y}, y) = -y \log \hat{y} - (1-y) \log(1-\hat{y})\).

    • if \(y=1\), then \(L(\hat{y}, y) = -\log \hat{y} \,\) is minimised if \(\hat{y} \rightarrow 1\)
    • if \(y=0\), then \(L(\hat{y}, y) = -\log(1-\hat{y}) \,\) is minimised if \(\hat{y} \rightarrow 0\)

for classification tasks, cross-entropy loss is more preferable:

  • strong penalty for highly confident wrong predictions

  • measure of difference between probability distributions

  • lead to convex loss function (with logistic activation)

    ensure gradient descent does not stuck in local minima

for all \(m\) samples, we can define cost function to be:

\[J(\mathbf{w}, b) = -\frac{1}{m} \sum_{i=1}^m \left[y^{(i)} \log a^{(i)} + (1-y^{(i)}) \log(1-a^{(i)}) \right]\]

our objective is to find \(\mathbf{w}^*\) and \(b^*\) in parameter space such that \(J(\mathbf{w}, b)\) is minimised

vectorisation

for a dataset with \(m\) samples, we introduce

\[\begin{align*} \text{input: } & X =\begin{bmatrix} \vdots & \vdots & &\vdots \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ \vdots & \vdots & &\vdots \end{bmatrix}_{n_x \times m}\\ \text{weights: } & W = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_{n_x} \end{bmatrix}_{n_x \times 1}\\ \text{bias: } & b = \begin{bmatrix} b & b & \cdots & b \end{bmatrix}_{1 \times m} \end{align*}\]

then we can write:

\[\begin{align*} Z &= W^T X + b = \begin{bmatrix} z^{(1)} & z^{(2)} & \cdots & z^{(m)} \end{bmatrix}_{1 \times m} \\ A &= \begin{bmatrix} g(z^{(1)}) & g(z^{(2)}) & \cdots & g(z^{(m)}) \end{bmatrix}_{1 \times m} = \begin{bmatrix} a^{(1)} & a^{(2)} & \cdots & a^{(m)} \end{bmatrix}_{1 \times m} \\ \end{align*}\]

matrix multiplication \(W^T X\) can be implemented with np.dot

to add bias \(b\), Python can broadcast a scalar \(b\) to an array that matches the shape of \(W^T X\)

activation functions

sigmoid

\[\sigma(z) = \frac{1}{1 + e^{-z}}\]
  • limits: \(\displaystyle \lim_{z\to+\infty} \sigma(z) = +1\) and \(\displaystyle \lim_{z\to-\infty} \sigma(z) = 0\)
  • derivative: \(\displaystyle \sigma'(z) = \frac{e^{-z}}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}} \cdot \frac{e^{-z}}{1+e^{-z}} \;\Rightarrow \;\sigma'(z) = \sigma(z)(1-\sigma(z))\).
  • limit of derivative: \(\displaystyle \lim_{z\to\pm\infty} \sigma'(z) = 0\)

hyperbolic tangent

\[\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\]
  • limits: \(\displaystyle \lim_{z\to\pm\infty} \tanh(z) = \pm1\)
  • derivative: \(\tanh'(z) - 1 - \tanh^2(z)\)
  • limit of derivative: \(\displaystyle \lim_{z\to\pm\infty} \tanh'(z) = 0\)

ReLU (Rectified Linear Unit)

\[\text{ReLU}(z) = \text{max}(0,z) = \left\{\begin{array}{ll} z & \text{for } z \geq 0 \\ 0 & \text{for } z < 0 \end{array} \right.\]
  • derivative: \(\displaystyle \text{ReLU}(z) = \left\{\begin{array}{ll} 1 & \text{for } z \geq 0 \\ 0 & \text{for } z < 0 \end{array} \right.\)

    technically \(\text{ReLU}(z)\) is not defined at \(z=0\), but in practice \(p(z=0)\approx0\)

leaky ReLU

\[\text{leaky ReLU} = \text{max}(\beta,z) = \left\{\begin{array}{ll} z & \text{for } z \geq 0 \\ \beta z & \text{for } z < 0 \end{array} \right. \quad \text{where }\beta \ll 1\]

how to choose activation functions?

an empirical rule for choosing activation functions:

  • output layer of binary classifier \(\Rightarrow\) sigmoid function
  • hidden layers in the network \(\Rightarrow\) \(\text{ReLU}\), \(\tanh\)

gradient descent and backward propagation

iteratively update parameters so gradient descent takes \(J(\mathbf{w}, b)\) closer to its global minimum

\[\mathbf{w} := \mathbf{w} - \alpha \frac{\partial J}{\partial \mathbf{w}} \qquad b := b - \alpha \frac{\partial J}{\partial b}\]

the hyperparameter \(\alpha\) is the learning rate

backward propagation for cross-entropy

the derivatives \(\displaystyle \frac{\partial J}{\partial \mathbf{w}}\) and \(\displaystyle \frac{\partial J}{\partial b}\) are computed using the chain rule

it turns out that these derivatives can be calculated using the cached values obtained during forward propagation

for cross-entropy loss:

\[\begin{align*} \frac{\partial J}{\partial a^{(i)}} &= -\frac{1}{m} \frac{\partial}{\partial a^{(i)}} \left[ y^{(i)} \log a^{(i)} + (1-y^{(i)}) \log(1-a^{(i)}) \right] \\ \Rightarrow \frac{\partial J}{\partial a^{(i)}} & = -\frac{1}{m} \left[ \frac{y^{(i)}}{a^{(i)}} - \frac{1-y^{(i)}}{1-a^{(i)}} \right] \end{align*}\]

suppose we use sigmoid as the activation function

\[\begin{align*} \frac{\partial J}{\partial z^{(i)}} &= \frac{\partial J}{\partial a^{(i)}} \frac{\partial a^{(i)}}{\partial z^{(i)}}\\ &= -\frac{1}{m} \left[ \frac{y^{(i)}}{a^{(i)}} - \frac{1-y^{(i)}}{1-a^{(i)}} \right] \times a^{(i)}(1-a^{(i)}) \\ &= -\frac{1}{m} \left[ y^{(i)}(1-a^{(i)}) - (1-y^{(i)})a^{(i)}\right] \\ \Rightarrow \frac{\partial J}{\partial z^{(i)}} &= \frac{1}{m} \left( a^{(i)} - y^{(i)}\right) \end{align*}\]

with \(z^{(i)} = \mathbf{w}^T \mathbf{x}^{(i)} + b\), we have

\[\begin{align*} \frac{\partial J}{\partial b} &= \sum_{i=1}^m \frac{\partial J}{\partial z^{(i)}} \frac{\partial z^{(i)}}{\partial b} = \frac{1}{m} \sum_{i=1}^m \left( a^{(i)} - y^{(i)}\right) \\ \frac{\partial J}{\partial w_k} &= \sum_{i=1}^m \frac{\partial J}{\partial z^{(i)}} \frac{\partial z^{(i)}}{\partial w_k} = \frac{1}{m} \sum_{i=1}^m \left( a^{(i)} - y^{(i)}\right) x_k^{(i)} \\ \end{align*}\]

by spirit of vectorisation, we can introduce

\[\begin{align*} X & = \begin{bmatrix} \vdots & \vdots & &\vdots \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ \vdots & \vdots & &\vdots \end{bmatrix}_{n_x \times m}\\ Y & = \begin{bmatrix} y^{(1)} & y^{(2)} & \cdots & y^{(m)}\end{bmatrix}_{1 \times m}\\ A & = \begin{bmatrix} a^{(1)} & a^{(2)} & \cdots & a^{(m)} \end{bmatrix}_{1 \times m} \end{align*}\]

then we have:

\[\begin{align*} db &\equiv \frac{\partial J}{\partial b} = \frac{1}{m} \times (\text{sum of elements of } A-Y) \\ dW &\equiv \begin{bmatrix} \frac{\partial J}{\partial w_1} \\ \frac{\partial J}{\partial w_2} \\ \vdots \\ \frac{\partial J}{\partial w_{n_x}} \end{bmatrix}_{n_x \times 1} = \frac{1}{m} X(A-Y)^T \end{align*}\]

these results can be implemented with the pseudo Python codes

\[\begin{align*} db &= \frac{1}{m} * \text{np.sum}(A - Y) \\ dW &= \frac{1}{m} * \text{np.dot}(X, (A-Y)\text{.T})). \end{align*}\]

implementation of binary classifier

initialisation of weights

W = np.random.randn(n_x, 1)
b = 0
  • when initialising the weights, one can multiply \(W\) by a small number (e.g., multiply by 0.01) to keep away from the flat part of the sigmoid activation

  • bias \(b\) is usually initialised to 0

optimisation (iteration through for loop):

# forward propagation
Z = np.dot(W.T, X) + b
A = sigmoid(Z)

# backward propagation
dZ = 1/m * (A-Y)
dW = np.dot(X, dZ.T)
db = np.sum(dZ)

# gradient descent
w = w - learning_rate * dW
b = b - learning_rate * db