This came out of an email exchange that I had with my dear friend Ben Shear and I eventually realized it could benefit more people. If you have two log-normal random variables how can you correlate them *the right way*? When I say *the right way* I mean that you both get the population correlation that you want and the 1-dimensional marginals that you want.

**Naive approach #1 (Wrong): Correlate first, transform later**

This one is a popular one:

`library(MASS)`

`mult_norm<- mvrnorm(1000, c(0,0), matrix(c(1,.5,.5,1),2,2))`

`log_norm<- exp(mult_norm)`

This one gets you the right marginals (they are, indeed, log-normal). But the wrong correlation matrix as you can see:

`cor(mult_norm)`

[,1] [,2]

[1,] 1.0000000 0.4967083

[2,] 0.4967083 1.0000000

> cor(log_norm)

[,1] [,2]

[1,] 1.0000000 0.4419539

[2,] 0.4419539 1.0000000

The resulting correlation matrix is biased downwards. So that won’t do.

**Naive approach #2 (Wrong as well): Transform first, correlate later**

This one is also very appealing because of its simplicity and because it relies on the construction of a bivariate normal distribution. Like, if are each independent and identically distributed define . Then are bivariate normally distributed with and population correlation coefficient .

What would happen then if we did and did ?

In this case, would still be log-normal and . But would NOT be log-normally distributed. The Dufresne (2009) paper is the standard reference for this, but there have been some interesting developments, like this approximation using the log-skew-normal distribution. In any case, the distribution of isn’t known.

**A third (correct) way**

This is essentially a re-hash of Appendix A in Astivia & Zumbo (2017) where it’s not too difficult to show that if are standard bivariate normal with correlation coefficient then the correlation between is:

So what we can do is solve for above and find that:

And what we do is that we go back to Naive Approach #1 and plug **that** as the initial correlation coefficient so that, through the process of exponentiation, we get back the value for that we want our log-normal marginal distributions to have in the population.

Small simulation check:

`library(MASS)`

`###I want a population correlation between`

###the log-normals of 0.5 so:

`rr<-double(1000)`

`for(i in 1:1000){`

mult_norm<-mvrnorm(1000, c(0,0), matrix(c(1,log(.5*(exp(1)-1)+1),log(.5*(exp(1)-1)+1),1),2,2))

`log_norm<-exp(mult_norm)`

`rr[i]<-cor(log_norm)[1,2]`

`}`

mean(rr)

>0.5019529

And the joint distribution would look like this:

We usually call the correlation that we need to set for the initial bivariate normal the “intermediate correlation” and it gives us some interesting insights into how the correlation coefficient behaves when the marginals are not normal.

For example here if I wanted to give my log-normals a correlation I’d do:

` log(-.5*(exp(1)-1)+1)`

[1] -1.959995

And find out that I can’t. Two log-normal random variables generated from a standard bivariate normal distribution CANNOT be correlated at -0.5. The maximum possible negative correlation is actually .

Pingback: Creating Data With Pre-determined Correlations In R - Programming Questions And Solutions Blog