Understanding the Relationship Between Leverage and Mahalanobis Distance

Author

Lam Fu Yuan, Kevin

Published

December 15, 2022

In Multiple Linear Regression, it is useful to detect the presence of outliers in the sample. An indicator of the presence of outliers is the leverage (McCullagh & Nelder, 1989, p. 405). The leverage is a measure of the distance between an observation in the sample and the sample mean vector (p. 405). Given that the leverage is a measure of distance, it is perhaps unsurprising that it is related to the another measure of distance known as the Mahalanobis distance (Mahalanobis, 1936). In this post, I prove the following mathematical relationship between the leverage and the Mahalanobis distance:

di2=(n1)(hii1n) where di2 is the square of the Mahalanobis distance between the i-th observation and the sample mean vector, and hii is the leverage of the i-th observation, for i=1,2,,n.

Notations

Before we proceed to prove the mathematical relationship between the leverage and the Mahalanobis distance, it is useful to introduce the notation that will be used in the proof.

Sample. Let X=(x1,x2,,xn)T be an n×p matrix which represents a sample of n observations across p covariates. In X, the i-th row represents the i-th observation and the j-th column represents the j-th covariate, for i=1,2,,n and j=1,2,,p.

Sample Means. Let μ=(μ1,μ2,,μp)T=1nXTJn,1 be a column vector with p elements which represents the sample mean vector, where Jn,1 is an column vector with n 1s.

Sample Covariances. Let Σ=1n1(XJn,1μT)T(XJn,1μT) be a p×p matrix which represents the sample covariance matrix.

Model Matrix. Let (Jn,1X)=((1x1T)T,(1x2T)T,,(1xnT)T)T be an n×(p+1) matrix which represents the model matrix which includes an intercept in addition to the p covariates.

Leverage. The leverage is a measure of the distance between an observation in the sample and the sample mean vector (McCullagh & Nelder, 1989, p. 405). Let hii be the leverage of the i-th observation in the sample, for i=1,2,,n. By definition,

hii=(1xiT)((Jn,1X)T(Jn,1X))1(1xiT)T

Mahalanobis Distance. The Mahalanobis distance between two vectors is a measure of the distance between the vectors. Let di2 be the square of the Mahalanobis distance between the i-th observation in the sample and the sample mean vector, for i=1,2,,n in the sample. By definition,

di2=(xiμ)TΣ1(xiμ)

Result

Now that we have introduced the notation that will be used in the proof, let us proceed to prove the mathematical relationship between the leverage and the Mahalanobis distance.

Theorem 1. The Mahalanobis distances are related to the leverages as follows: di2=(n1)(hii1n) Proof:

hii=(1xiT)((Jn,1X)T(Jn,1X))1(1xiT)T=(1xiT)(nnμTnμXTX)1(1xiT)T=(1xiT)(1n+1nnμT1n1Σ1nμ1n1nnμT1n1Σ11n1Σ1nμ1n1n1Σ1)(1xiT)T=(1xiT)(1n+1n1μTΣ1μ1n1μTΣ11n1Σ1μ1n1Σ1)(1xiT)T=(1n+1n1μTΣ1μ1n1xiTΣ1μ1n1μTΣ1+1n1xiTΣ1)(1xiT)T=1n+1n1(μTΣ1μxiTΣ1μμTΣ1xi+xiTΣ1xi)=1n+1n1[(xiTμT)Σ1μ+(xiTμT)Σ1xi]=1n+1n1[(xiTμT)Σ1(xiμ)]=1n+1n1di2

Therefore,

di2=(n1)(hii1n) which was to be demonstrated.

In the proof,

(nnμTnμXTX) is inverted blockwise using the analytic inversion formula:

(ABCD)1=(A1+A1B(DCA1B)1CA1A1B(DCA1B)1(DCA1B)1CA1(DCA1B)1)

In the analytic inversion formula,

DCA1B=XTXnμ1nnμT=XTXnμμT=(n1)Σ

This is because

Σ=1n1(XJn,1μT)T(XJn,1μT)=1n1(XTXXTJn,1μTμJ1,nX+μJ1,nJn,1μT)=1n1(XTXnμμTnμμT+nμμT)=1n1(XTXnμμT)

References

Mahalanobis, C. P. (1936). On the generalized distance in statistics. In Proceedings of the National Institute of Sciences of India (Vol. 2, No. 1, pp. 49-55).

McCullagh, P., & Nelder, J. A. (1989). Generalized linear models. Springer.

Appendix

In the Appendix, I present some results on the equivalence between statistics obtained from samples without and with mean-centering.

Notations

Sample. Let Xc=XJn,1μT=(x1c,x2c,,xnc)T be an n×p matrix which represents the mean-centred sample.

Sample Means. Let μc=(μ1c,μ2c,,μpc)T=0p,1 be a column vector with p elements which represents the mean-centred sample mean vector.

Sample Covariances. Let ${c}=(X{c}-J_{n,1}({c}){T}){T}(X{c}-J_{n,1}({c}){T})=(X{c}){T}(X^{c})$ be a p×p matrix which represents the mean-centred sample covariance matrix.

Model Matrix. Let (Jn,1Xc) be an n×(p+1) matrix which represents the model matrix using the mean-centred sample.

Leverage. Let hiic be the leverage of the i-th observation in the mean-centred sample, for i=1,2,,n. By definition,

hiic=(1(xic)T)((Jn,1Xc)T(Jn,1Xc))1(1(xic)T)T=(1xiTμT)((Jn,1Xc)T(Jn,1Xc))1(1xiTμT)T

Mahalanobis Distance. Let (dic)2 be the square of the Mahalnobis distance between the i-th observation and μc, for i=1,2,,n in the mean-centred sample. By definition,

(dic)2=(xicμc)T(Σc)1(xicμc)=(xiμ)T(Σc)1(xiμ)

Results

Proposition 1. The sample covariance matrix is the same regardless of whether the sample has been mean-centred. In other words, Σ=Σc.

Proof:

Σ=1n1(XJn,1μT)T(XJn,1μT)=1n1(Xc)T(Xc)=Σc

which was to be demonstrated.

Proposition 2. The leverages are the same regardless of whether the sample has been mean-centred. In other words, hii=hiic.

Proof:

hiic=(1xiTμT)((Jn,1Xc)T(Jn,1Xc))1(1xiTμT)T=(1xiTμT)(n01,p0p,1(Xc)TXc)1(1xiTμT)T=(1xiTμT)(n01,p0p,1(n1)Σ)1(1xiTμT)T=(1xiTμT)(1n01,p0p,11n1Σ1)(1xiTμT)T=(1n1n1(xiTμT)Σ1)(1xiTμT)T=1n+1n1[(xiTμT)Σ1(xiμ)]=1n+1n1di2=hii

which was to be demonstrated.

In the proof, (n01,p0p,1(n1)Σ) is a diagonal block matrix and therefore inverted blockwise.

Proposition 3. The Mahalanobis distances are the same regardless of whether the sample has been mean-centred. In other words, di=dic.

Proof:

di2=(xiμ)TΣ1(xiμ)=(xiμ)T(Σc)1(xiμ)=(dic)2

Therefore,

di=dic

which was to be demonstrated.