How to Build a Logistic Regression Model from Scratch in R

In this document, I demonstrate how the Newton-Raphson Method can be used to approximate the Maximum Likelihood Estimates of the parameters of a Logistic Regression model.

Learning Objectives

Describe the Newton-Raphson Method to approximate the Maximum Likelihood Estimates of the parameters of a Logistic Regression model.
Use the Newton-Raphson Method to approximate the Maximum Likelihood Estimates of the parameters of a Logistic Regression model in R.

1 Logistic Regression

Let $Y = (Y_{1}, Y_{2}, \dots, Y_{n})^{T}$ be a random sample of size $n$ where $Y_{i} \sim B (1, π_{i})$ , $i = 1, 2, \dots, n$ , are $n$ independent random scalars which follow a Bernoulli distribution with parameter $π_{i}$ . Let $y = (y_{1}, y_{2}, \dots, y_{n})^{T}$ be an observed random sample in which $y_{i}$ , $i = 1, 2, \dots, n$ , are $n$ realisations of the random scalars $Y_{i}$ , $i = 1, 2, \dots, n$ . The probability mass function of $Y_{i}$ is

$\begin{array}{r} f (y_{i}; π) = P (y_{i}; π) = P (Y_{i} = y_{i}) = π_{i}^{y_{i}} (1 - π_{i})^{1 - y_{i}} \end{array}$

Let $X = (x_{1}, x_{2}, \dots, x_{n})^{T}$ be a $n \times p$ data matrix in which each row $x_{i}^{T}$ , $i = 1, 2, \dots, n$ , represents an observation and each column represents a covariate. Let $β = (β_{1}, β_{2}, \dots, β_{p})^{T}$ be the parameters of the Multiple Logistic Regression model. The model is

$\begin{array}{r} \log (\frac{π_{i}}{1 - π_{i}}) = x_{i}^{T} β = β_{1} x_{i 1} + + β_{2} x_{i 2} + \dots + β_{p} x_{i p} \end{array}$ or

$\begin{array}{r} π_{i} = \frac{e^{x_{i}^{T} β}}{1 + e^{x_{i}^{T} β}} = \frac{e^{β_{1} x_{i 1} + β_{2} x_{i 2} + \dots + β_{p} x_{i p}}}{1 + e^{β_{1} x_{i 1} + β_{2} x_{i 2} + \dots + β_{p} x_{i p}}} \end{array}$

2 Maximum Likelihood Estimation

In Maximum Likelihood Estimation, the problem is to find values of $β = (β_{1}, β_{2}, \dots, β_{p})^{T}$ so as to

$\begin{array}{r} max L (β) \end{array}$

where $L (β)$ is the likelihood function.

2.1 Likelihood Function

The likelihood function for $β$ is

$\begin{aligned} L (β) & = \prod_{i = 1}^{n} f (y_{i}; π_{i}) = \prod_{i = 1}^{n} P (y_{i}; π_{i}) = \prod_{i = 1}^{n} P (Y_{i} = y_{i}) \\ = \prod_{i = 1}^{n} π_{i}^{y_{i}} (1 - π_{i})^{1 - y_{i}} \\ = \prod_{i = 1}^{n} {(\frac{e^{x_{i}^{T} β}}{1 + e^{x_{i}^{T} β}})}^{y_{i}} {(\frac{1}{1 + e^{x_{i}^{T} β}})}^{1 - y_{i}} \end{aligned}$

Therefore, the log-likelihood function for $β$ is

$\begin{aligned} l (β) & = \sum_{i = 1}^{n} [y_{i} \log (\frac{e^{x_{i}^{T} β}}{1 + e^{x_{i}^{T} β}}) + (1 - y_{i}) \log (\frac{1}{1 + e^{x_{i}^{T} β}})] \\ = \sum_{i = 1}^{n} [y_{i} x_{i}^{T} β - y_{i} \log (1 + e^{x_{i}^{T} β}) - \log (1 + e^{x_{i}^{T} β}) + y_{i} \log (1 + e^{x_{i}^{T} β})] \\ = \sum_{i = 1}^{n} [y_{i} x_{i}^{T} β - \log (1 + e^{x_{i}^{T} β})] \end{aligned}$

2.2 First Derivative

The first derivative of the log-likelihood function with respect to the (r)-th parameter $β_{r}$ , $r = 1, 2, \dots, p$ is

$\begin{aligned} \frac{d l}{d β_{r}} & = \frac{d}{d β_{r}} \sum_{i = 1}^{n} [y_{i} x_{i}^{T} β - \log (1 + e^{x_{i}^{T} β})] \\ = \frac{d}{d β_{r}} \sum_{i = 1}^{n} [y_{i} x_{i 1} β_{1} + y_{i} x_{i 2} β_{2} + \dots + y_i x_{i p} β_{p} - \log (1 + e^{x_{i 1} β_{1} + x_{i 2} β_{2} + \dots + x_{i p} β_{p}})] \\ = \sum_{i = 1}^{n} (y_{i} x_{i r} - \frac{e^{x_{i 1} β_{1} + x_{i 2} β_{2} + \dots + x_{i p} β_{p}}}{1 + e^{x_{i 1} β_{1} + x_{i 2} β_{2} + \dots + x_{i p} β_{p}}} x_{i r}) \\ = \sum_{i = 1}^{n} [(y_{i} - \frac{e^{x_{i}^{T} β}}{1 + e^{x_{i}^{T} β}}) x_{i r}] \\ = \sum_{i = 1}^{n} (y_{i} - π_{i}) x_{i r} \end{aligned}$

Therefore,

$\begin{aligned} l^{'} (β) & = \frac{d l}{d β} \\ = {[\frac{d l}{d β_{1}} \frac{d l}{d β_{2}} \dots \frac{d l}{d β_{p}}]}^{T} \\ = [\begin{array}{c} (y_{1} - π_{1}) x_{11} + (y_{2} - π_{2}) x_{21} + \dots + (y_{n} - π_{n}) x_{n 1} \\ (y_{1} - π_{1}) x_{12} + (y_{2} - π_{2}) x_{22} + \dots + (y_{n} - π_{n}) x_{n 2} \\ ⋮ \\ (y_{1} - π_{1}) x_{1 p} + (y_{2} - π_{2}) x_{2 p} + \dots + (y_{n} - π_{n}) x_{n p} \end{array}] \\ = [\begin{array}{c} x_{11} & x_{21} & \dots & x_{n 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{1 p} & x_{2 p} & \dots & x_{n p} \end{array}] [\begin{array}{c} (y_{1} - π_{1}) \\ ⋮ \\ (y_{p} - π_{p}) \end{array}] \\ = X^{T} (y - π) \end{aligned}$

2.2 Second Derivative

The second derivative of the log-likelihood function with respect to the $r$ -th and $r$ -th parameters $β_{r}$ , $β_{s}$ , $r, s = 1, 2, \dots, p$ is

$\begin{aligned} \frac{d^{2} l}{d β_{r} d β_{s}} & = \frac{d}{d β_{s}} \sum_{i = 1}^{n} [(y_{i} - \frac{e^{x_{i}^{T} β}}{1 + e^{x_{i}^{T} β}}) x_{i r}] \\ = \frac{d}{d β_{s}} \sum_{i = 1}^{n} [y_{i} x_{i r} - (\frac{e^{x_{i 1} β_{1} + x_{i 2} β_{2} + \dots + x_{i p} β_{p}}}{1 + e^{e^{x_{i 1} β_{1} + x_{i 2} β_{2} + \dots + x_{i p} β_{p}}}}) x_{i r}] \\ = - \sum_{i = 1}^{n} [\frac{e^{x_{i 1} β_{1} + x_{i 2} β_{2} + \dots + x_{i p} β_{p}}}{1 + e^{x_{i 1} β_{1} + x_{i 2} β_{2} + \dots + x_{i p} β_{p}}} x_{i r} x_{i s} - \frac{{(e^{x_{i 1} β_{1} + x_{i 2} β_{2} + \dots + x_{i p} β_{p}})}^{2}}{{(1 + e^{x_{i 1} β_{1} + x_{i 2} β_{2} + \dots + x_{i p} β_{p}})}^{2}} x_{i r} x_{i s}] \\ = - \sum_{i = 1}^{n} [\frac{e^{x_{i}^{T} β}}{1 + e^{x_{i}^{T} β}} x_{i r} x_{i s} - \frac{{(e^{x_{i}^{T} β})}^{2}}{{(1 + e^{x_{i}^{T} β})}^{2}} x_{i r} x_{i s}] \\ = - \sum_{i = 1}^{n} [\frac{e^{x_{i}^{T} β} + {(e^{x_{i}^{T} β})}^{2}}{{(1 + e^{x_{i}^{T} β})}^{2}} x_{i r} x_{i s} - \frac{{(e^{x_{i}^{T} β})}^{2}}{{(1 + e^{x_{i}^{T} β})}^{2}} x_{i r} x_{i s}] \\ = - \sum_{i = 1}^{n} \frac{e^{x_{i}^{T} β}}{{(1 + e^{x_{i}^{T} β})}^{2}} x_{i r} x_{i s} \\ = - \sum_{i = 1}^{n} π_{i} (1 - π_{i}) x_{i r} x_{i s} \end{aligned}$

Therefore,

$\begin{aligned} l ” (β) & = [\begin{array}{c} \frac{d^{2} l}{d β_{1}^{2}} & \frac{d^{2} l}{d β_{1} d β_{2}} & \dots & \frac{d^{2} l}{d β_{1} d β_{p}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{d^{2} l}{d β_{p} d β_{1}} & \frac{d^{2} l}{d β_{p} d β_{2}} & \dots & \frac{d^{2} l}{d^{2} β_{p}} \end{array}] \\ = - [\begin{array}{c} \sum_{i = 1}^{n} π_{i} (1 - π_{i}) x_{i 1} x_{i 1} & \dots & \sum_{i = 1}^{n} π_{i} (1 - π_{i}) x_{i 1} x_{i p} \\ ⋮ & ⋱ & ⋮ \\ \sum_{i = 1}^{n} π_{i} (1 - π_{i}) x_{i p} x_{i 1} & \dots & \sum_{i = 1}^{n} π_{i} (1 - π_{i}) x_{i p} x_{i p} \end{array}] \\ = - [\begin{array}{c} x_{11} & x_{21} & \dots & x_{n 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{1 p} & x_{2 p} & \dots & x_{n p} \end{array}] [\begin{array}{c} π_{1} (1 - π_{1}) & 0 & \dots & 0 \\ 0 & π_{2} (1 - π_{2}) & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & π_{n} (1 - π_{n}) \end{array}] [\begin{array}{c} x_{11} & x_{12} & \dots & x_{1 p} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{n 1} & x_{n 2} & \dots & x_{n p} \end{array}] \\ = - X^{T} W X \end{aligned}$

where $W = diag (π_{i} (1 - π_{i})), i = 1, 2, \dots, n$ .

3. Newton-Raphson Method

Let $θ$ be a parameter and $f$ be a function of $θ$ . Suppose that we are interested to find the root of $f$ (i.e., the value of $θ$ where $f (θ) = 0$ ). The Newton-Raphson Method is a numerical method to approximate the root of $f$ (Thomas et al., 2005).

Algorithm 1

Newton-Raphson Method

Guess a first approximation of the root of (f).
Use the first approximation to get a second approximation, the second approximation to get a third approximation, etc., using the formula $θ_{k + 1} = θ_{k} - \frac{f (θ_{k})}{f^{'} (θ_{k})}$ where $θ_{k}$ is the Newton-Raphson approximation of the root of $f$ in the $k$ -th iteration.

3.1 R

In this section, I demonstrate how the Newton-Raphson Method can be used to approximate the Maximum Likelihood Estimates of the parameters of a Logistic Regression model in $R$ .

First I load and prepare the data. The observed random sample $y$ is assigned to the variable $y$ ; and the data matrix $X$ is assigned to the variable $X$ .

df <- read.csv(file="https://stats.idre.ucla.edu/stat/data/binary.csv")
X <- as.matrix(data.frame(intercept=rep(x=1, times=nrow(df)),gpa=df$gpa, gre=df$gre))
y <- df$admit

Second, I define the function $newtonRaphsonLogReg$ which uses the Newton-Raphson Method to approximate the Maximum Likelihood Estimates of the parameters of a Logistic Regression model.

#' newtonRaphsonLogReg
#' @param X: A data matrix
#' @param y: A response vector
#' @param maxit: (Optional) maximum number of iterations
#' @param thr: (Optional) convergence threshold
#' @return Maximum likelihood estimates of regression parameters

newtonRaphsonLogReg <- function(X, y, maxit=10, thr=0.001) {
  b <- solve(t(X)%*%X)%*%t(X)%*%y # Initial Guess
  k <- 1 # Initial Iteration Count
  d <- max(abs(b)) # Initial Delta
  while(k<=maxit & d>thr) {
    p <- as.vector(exp(X%*%b)/(1+exp(X%*%b))) # Probabilities
    W <- diag(x=p*(1-p),nrow=length(p),ncol=length(p)) # Weight Matrix
    z <- X%*%b+solve(W)%*%(y-p) # Adjusted Response Vector
    b_new <- solve(t(X)%*%W%*%X)%*%t(X)%*%W%*%z # New Guess
    d <- max(abs(b_new-b)) # Update Delta
    b <- b_new # Update Guess
    cat(paste0("[Iteration ", k, "]: ", d, "\n"))
    k <- k+1
  }
  return(b)
}

Last, I use the function.

newtonRaphsonLogReg(X=X, y=y)

[Iteration 1]: 3.56198602366629
[Iteration 2]: 0.824206953925452
[Iteration 3]: 0.0351788519326073
[Iteration 4]: 7.20576835240294e-05

                  [,1]
intercept -4.949378062
gpa        0.754686856
gre        0.002690684

The results are similar to those obtained using the $glm$ function.

summary(glm(admit~gpa+gre, family="binomial", data=df))


Call:
glm(formula = admit ~ gpa + gre, family = "binomial", data = df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.949378   1.075093  -4.604 4.15e-06 ***
gpa          0.754687   0.319586   2.361   0.0182 *  
gre          0.002691   0.001057   2.544   0.0109 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 480.34  on 397  degrees of freedom
AIC: 486.34

Number of Fisher Scoring iterations: 4

4 References

Thomas, G. B., Weir, M. D., Hass, J., & Giordano, F. R. (2005). Thomas’ calculus (pp. 2379-8858). Addison-Wesley.