Binary Classifier with Logistic Regression (Deep Learning Notes C1W1-3)
These are some notes for the Coursera Course on Neural Networks and Deep Learning, which is a part of the Deep Learning Specialization. This post is a summary of the course contents that I learned from Week 1 to Week 3.
These notes are far from original. All credits go to the course instructor Andrew Ng and his team.
forward propagation
logistic regression is used to predict an output/prediction \(\hat{y}\) from an input \(\mathbf{x}\)
- \(z = \mathbf{w}^T \mathbf{x} + b\) (weighted sum of the input plus a bias term)
- \(a = g(z)\) (activation function to introduce non-linearity)
- \(\hat{y} = a\) (predicted output)
where \(\displaystyle \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_{n_x} \end{bmatrix}, \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_{n_x} \end{bmatrix}\) and \(z, b \in \mathbb{R}\).
estimation of loss
we want to have \(\hat{y} = y\) (i.e., prediction by model is the same as the desired output)
-
mean square error (MSE): \(L(\hat{y}, y) = \frac{1}{2}(\hat{y} - y)^2\).
-
cross-entropy: \(L(\hat{y}, y) = -y \log \hat{y} - (1-y) \log(1-\hat{y})\).
- if \(y=1\), then \(L(\hat{y}, y) = -\log \hat{y} \,\) is minimised if \(\hat{y} \rightarrow 1\)
- if \(y=0\), then \(L(\hat{y}, y) = -\log(1-\hat{y}) \,\) is minimised if \(\hat{y} \rightarrow 0\)
for classification tasks, cross-entropy loss is more preferable:
-
strong penalty for highly confident wrong predictions
-
measure of difference between probability distributions
-
lead to convex loss function (with logistic activation)
ensure gradient descent does not stuck in local minima
for all \(m\) samples, we can define cost function to be:
\[J(\mathbf{w}, b) = -\frac{1}{m} \sum_{i=1}^m \left[y^{(i)} \log a^{(i)} + (1-y^{(i)}) \log(1-a^{(i)}) \right]\]our objective is to find \(\mathbf{w}^*\) and \(b^*\) in parameter space such that \(J(\mathbf{w}, b)\) is minimised
vectorisation
for a dataset with \(m\) samples, we introduce
\[\begin{align*} \text{input: } & X =\begin{bmatrix} \vdots & \vdots & &\vdots \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ \vdots & \vdots & &\vdots \end{bmatrix}_{n_x \times m}\\ \text{weights: } & W = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_{n_x} \end{bmatrix}_{n_x \times 1}\\ \text{bias: } & b = \begin{bmatrix} b & b & \cdots & b \end{bmatrix}_{1 \times m} \end{align*}\]then we can write:
\[\begin{align*} Z &= W^T X + b = \begin{bmatrix} z^{(1)} & z^{(2)} & \cdots & z^{(m)} \end{bmatrix}_{1 \times m} \\ A &= \begin{bmatrix} g(z^{(1)}) & g(z^{(2)}) & \cdots & g(z^{(m)}) \end{bmatrix}_{1 \times m} = \begin{bmatrix} a^{(1)} & a^{(2)} & \cdots & a^{(m)} \end{bmatrix}_{1 \times m} \\ \end{align*}\]matrix multiplication \(W^T X\) can be implemented with np.dot
to add bias \(b\), Python can broadcast a scalar \(b\) to an array that matches the shape of \(W^T X\)
activation functions
sigmoid
\[\sigma(z) = \frac{1}{1 + e^{-z}}\]
- limits: \(\displaystyle \lim_{z\to+\infty} \sigma(z) = +1\) and \(\displaystyle \lim_{z\to-\infty} \sigma(z) = 0\)
- derivative: \(\displaystyle \sigma'(z) = \frac{e^{-z}}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}} \cdot \frac{e^{-z}}{1+e^{-z}} \;\Rightarrow \;\sigma'(z) = \sigma(z)(1-\sigma(z))\).
- limit of derivative: \(\displaystyle \lim_{z\to\pm\infty} \sigma'(z) = 0\)
hyperbolic tangent
\[\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\]
- limits: \(\displaystyle \lim_{z\to\pm\infty} \tanh(z) = \pm1\)
- derivative: \(\tanh'(z) - 1 - \tanh^2(z)\)
- limit of derivative: \(\displaystyle \lim_{z\to\pm\infty} \tanh'(z) = 0\)
ReLU (Rectified Linear Unit)
\[\text{ReLU}(z) = \text{max}(0,z) = \left\{\begin{array}{ll} z & \text{for } z \geq 0 \\ 0 & \text{for } z < 0 \end{array} \right.\]
-
derivative: \(\displaystyle \text{ReLU}(z) = \left\{\begin{array}{ll} 1 & \text{for } z \geq 0 \\ 0 & \text{for } z < 0 \end{array} \right.\)
technically \(\text{ReLU}(z)\) is not defined at \(z=0\), but in practice \(p(z=0)\approx0\)
leaky ReLU
\[\text{leaky ReLU} = \text{max}(\beta,z) = \left\{\begin{array}{ll} z & \text{for } z \geq 0 \\ \beta z & \text{for } z < 0 \end{array} \right. \quad \text{where }\beta \ll 1\]
how to choose activation functions?
an empirical rule for choosing activation functions:
- output layer of binary classifier \(\Rightarrow\) sigmoid function
- hidden layers in the network \(\Rightarrow\) \(\text{ReLU}\), \(\tanh\)
gradient descent and backward propagation
iteratively update parameters so gradient descent takes \(J(\mathbf{w}, b)\) closer to its global minimum
\[\mathbf{w} := \mathbf{w} - \alpha \frac{\partial J}{\partial \mathbf{w}} \qquad b := b - \alpha \frac{\partial J}{\partial b}\]the hyperparameter \(\alpha\) is the learning rate
backward propagation for cross-entropy
the derivatives \(\displaystyle \frac{\partial J}{\partial \mathbf{w}}\) and \(\displaystyle \frac{\partial J}{\partial b}\) are computed using the chain rule
it turns out that these derivatives can be calculated using the cached values obtained during forward propagation
for cross-entropy loss:
\[\begin{align*} \frac{\partial J}{\partial a^{(i)}} &= -\frac{1}{m} \frac{\partial}{\partial a^{(i)}} \left[ y^{(i)} \log a^{(i)} + (1-y^{(i)}) \log(1-a^{(i)}) \right] \\ \Rightarrow \frac{\partial J}{\partial a^{(i)}} & = -\frac{1}{m} \left[ \frac{y^{(i)}}{a^{(i)}} - \frac{1-y^{(i)}}{1-a^{(i)}} \right] \end{align*}\]suppose we use sigmoid as the activation function
\[\begin{align*} \frac{\partial J}{\partial z^{(i)}} &= \frac{\partial J}{\partial a^{(i)}} \frac{\partial a^{(i)}}{\partial z^{(i)}}\\ &= -\frac{1}{m} \left[ \frac{y^{(i)}}{a^{(i)}} - \frac{1-y^{(i)}}{1-a^{(i)}} \right] \times a^{(i)}(1-a^{(i)}) \\ &= -\frac{1}{m} \left[ y^{(i)}(1-a^{(i)}) - (1-y^{(i)})a^{(i)}\right] \\ \Rightarrow \frac{\partial J}{\partial z^{(i)}} &= \frac{1}{m} \left( a^{(i)} - y^{(i)}\right) \end{align*}\]with \(z^{(i)} = \mathbf{w}^T \mathbf{x}^{(i)} + b\), we have
\[\begin{align*} \frac{\partial J}{\partial b} &= \sum_{i=1}^m \frac{\partial J}{\partial z^{(i)}} \frac{\partial z^{(i)}}{\partial b} = \frac{1}{m} \sum_{i=1}^m \left( a^{(i)} - y^{(i)}\right) \\ \frac{\partial J}{\partial w_k} &= \sum_{i=1}^m \frac{\partial J}{\partial z^{(i)}} \frac{\partial z^{(i)}}{\partial w_k} = \frac{1}{m} \sum_{i=1}^m \left( a^{(i)} - y^{(i)}\right) x_k^{(i)} \\ \end{align*}\]by spirit of vectorisation, we can introduce
\[\begin{align*} X & = \begin{bmatrix} \vdots & \vdots & &\vdots \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ \vdots & \vdots & &\vdots \end{bmatrix}_{n_x \times m}\\ Y & = \begin{bmatrix} y^{(1)} & y^{(2)} & \cdots & y^{(m)}\end{bmatrix}_{1 \times m}\\ A & = \begin{bmatrix} a^{(1)} & a^{(2)} & \cdots & a^{(m)} \end{bmatrix}_{1 \times m} \end{align*}\]then we have:
\[\begin{align*} db &\equiv \frac{\partial J}{\partial b} = \frac{1}{m} \times (\text{sum of elements of } A-Y) \\ dW &\equiv \begin{bmatrix} \frac{\partial J}{\partial w_1} \\ \frac{\partial J}{\partial w_2} \\ \vdots \\ \frac{\partial J}{\partial w_{n_x}} \end{bmatrix}_{n_x \times 1} = \frac{1}{m} X(A-Y)^T \end{align*}\]these results can be implemented with the pseudo Python codes
\[\begin{align*} db &= \frac{1}{m} * \text{np.sum}(A - Y) \\ dW &= \frac{1}{m} * \text{np.dot}(X, (A-Y)\text{.T})). \end{align*}\]implementation of binary classifier
initialisation of weights
W = np.random.randn(n_x, 1)
b = 0
-
when initialising the weights, one can multiply \(W\) by a small number (e.g., multiply by 0.01) to keep away from the flat part of the sigmoid activation
-
bias \(b\) is usually initialised to 0
optimisation (iteration through for loop):
# forward propagation
Z = np.dot(W.T, X) + b
A = sigmoid(Z)
# backward propagation
dZ = 1/m * (A-Y)
dW = np.dot(X, dZ.T)
db = np.sum(dZ)
# gradient descent
w = w - learning_rate * dW
b = b - learning_rate * db