Simulating multivariate outliers with known Mahalanobis’ distance.

Disclaimer: This entry is an extension on Rick Wicklin’s How to simulate multivariate outliers . The “main engine” behind this method is 100% due to him.

One of the last things I taught in my Intro to Monte Carlo Simulations class was how to simulate outliers. The “standard” would be to invoke something like a contaminated normal distribution, which is really just the very fancy name for a 2-component (univariate) Gaussian mixture where both distributions have the same mean but different variances, creating heavy tails and making this distribution more outlier-prone. You can generalize  this idea in several directions (different means and different variances for each component, multivariate Gaussian mixtures, etc.). Nevertheless, it doesn’t really provide a way to determine specific points as outliers. As a mixture, the population parameter that governs the outlier creation is the mixing proportion. So… yeah, not super direct.

Cue in (the SAS god) Rick Wicklin’s  How to simulate multivariate outliers where he introduces this very clever approach to simulate outlying points with specific Mahalanobis’ distances for the case of standard, bivariate normal data. There are 2 aspects to this method that make it work: (1) the realization that uncorrelated, bivariate standard normal data exists in a circle and (2) that when the data are uncorrelated, Mahalanobis’ distance reduces down to the regular Euclidean distance. So using the parametric equation x=rcos(\theta), y=rsin(\theta), where is the distance between the centroid of the circle (\mu_{x}, \mu_{y}) and the outlying points, one can basically set up the coordinates where the outlying points would be. Once a specific correlation matrix is induced on the data, the circle will become an ellipse and the Euclidean distance will become Mahalanobis’ distance. Rick goes at length on this in his blog so I won’t repeat it all. But I can show you a quick example. If \mu_{x}= \mu_{y}=0, \theta=\pi/4 and  r=10 the standard, bivariate normal data looks like:

outlier1

where the outlier is on the top in red. The distance from the centre to the outlier is 10 units. Now let’s say we want to induce a correlation of \rho=0.5 on this dataset. It would now look like:

And if you ask R to give you the Mahalanobis’ distance on that point, you’ll see it’s 100 (from which you get the square root to obtain the original r=10 we were talking about).

So far things are looking good… but then Rick mentions the following:

In two dimensions, you can use the formula (x(t), y(t)) = r*(cos(t), sin(t)) to specify an observation that is r units from the origin. If t is chosen randomly, you can obtain a random point on the circle of radius r. In higher dimensions, it is not easy to parameterize a sphere, which is the surface of an d-dimensional ball. Nevertheless, you can still generate random points on a d-dimensional sphere of radius r by doing the following:

So Rick’s solution, as stated in his blog, only allows us to simulate bivariate normal data where the user can set the Mahalanobis’ distance in 2-dimensions. I mean he does share a more general approach towards the end, but I couldn’t help but get “hooked on” the whole “it’s not easy to parameterize a sphere”. So I came up with an approach that simply changes the (x(\theta), y(\theta)) =  r(cost(\theta),sin(\theta)) step from using the parametric/polar approach to one using an algebraic approach to decide on the radii needed to get the right Mahalanobis’ distance in multiple dimensions. It’s really just one added step:

(1) Start with Y=\mathcal{N}(\mu, I_{d \times d}) like Rick does.

(2) Find an x such that the Mahalanobis’ distance between xY and \mu is r as follows:

r = \sqrt{(xY-\mu)^{t}\Sigma^{-1}(xY-\mu)}

r^{2} = x^{2}Y^{t}\Sigma^{-1}Y-2x\mu\Sigma^{-1}Y+\mu\Sigma^{-1}\mu

And solve for x like in any quadratic equation where x = \frac{-b \pm \sqrt{b^{2}-4ac}}{2a} The only difference in this case is that a=Y^{t}\Sigma^{-1}Y, b=\mu\Sigma^{-1}Y and c=\mu\Sigma^{-1}\mu-r^{2}.

(3) Once you have found x, then just generate another point from Y=\mathcal{N}(xY, I_{d \times d}) and proceed with the rest of Rick’s algorithm.

Let’s look at an R example for a simple 3D case. I want a sample of N=500 to come from:

\mathcal{N} \begin{bmatrix} \begin{pmatrix} 0\\ 0\\ 0 \end{pmatrix}, & \begin{pmatrix} 1 & .5 & .5\\ .5 & 1 & .5\\ .5 & .5 & 1 \end{pmatrix} \end{bmatrix}

AND I want 3 outlying points, each with equal Mahalanobis’ distance of 10. A 3D scatterplot of this would look like:

And if you were to run the mahalanobis() function in R and look at the final 5 points, you’d get something like:

> mahalanobis(x, c(0,0,0),Sigma)[496:500]
[1]   3.367694   6.595042 100.000000 100.000000 100.000000

Which then allows me to generate as many outliers as I want in as many dimensions as are needed.

Full R code for the function and example:

library('MASS')

sim_outliers <- function(N, mu, Sigma, MD) {

m <- length(MD)
n <- length(mu)
mu1 <- 0*mu
x <- mvrnorm(N-m, mu1, Sigma)


L <- chol(Sigma)
T <- diag(Sigma)
Lambda <- diag(T)%*%t(L)
Y <- matrix(0,m,n)

for (k in 1:m){
  u <- mvrnorm(1, mu1, Sigma)
  u <- Lambda%*%u
  c <- t(mu1)%*%solve(Sigma)%*%mu1-MD[k]**2
  b <- t(mu1)%*%solve(Sigma)%*%u
  a <- t(u)%*%solve(Sigma)%*%u
  root <- (-b+sqrt(b**2-4*a*c))/(2*a)
  Y[k,] <- root[1]*u
  } 
x <- rbind(x,Y) + sample(mu, N, replace=TRUE)
return(x)
}

### EXAMPLE ###
N <- 500

Sigma <- matrix(c(1,0.5,0.5,0.5,1,0.5,0.5, 0.5, 1), 3,3)
mu <- c(X = 0, Y = 0, Z=0)
MD <- c(10, 10, 10) ## these are the Mahalanobis' distances

x <- sim_outliers(N, mu, Sigma, MD)

> mahalanobis(x, c(0,0,0),Sigma)[496:500]
[1]   3.367694   6.595042 100.000000 100.000000 100.000000

> sqrt(mahalanobis(x, c(0,0,0),Sigma)[496:500])
[1]  1.835128  2.568081 10.000000 10.000000 10.000000