This entry offers a quick tutorial on how to use the accompanying code to my article Theoretical considerations when simulating data from the g-and-h family of distributions, co-authored with Dr. Edward Kroc
The code works in two steps that need to be run independently to economize time and make the process of of simulation more efficient. It implements the theory outlined in Kowalchuk & Headrick (2010) with additional modifications described in my article. The most important one being the use of homotopy continuation to estimate the multiple roots of the system of transcendental equations that describe the skewness and (excess) kurtosis of the g-and-h family of distributions. Let us proceed with an example.
Let’s pretend for a moment that you would like to simulate a tri-variate g-and-h random variable. You would like each 1-D marginal distribution to have a skewness of 2 and an excess kurtosis of 7. You would also like them to be equally correlated at
The first step is finding the values for the (g,h) parameters that would give you the skewness and excess kurtosis of your choice. To do that, you would run the gh_roots()
function which takes as an input (1) a vector of skewness parameters (one for each 1-D marginal); (2) a vector of excess kurtosis parameters; (3) the number of interval divisions in [0,1] to execute the homotopy continuation (uses the letter ‘m’ and the default set at 10); and (4) the number of attempted solutions to get (default set at 100). I have found that these numbers work well, but if you would like to find more than one pair of (g,h) parameters for the same values of skewness and excess kurtosis, you need to increase both the m number and the number of random attempts. Yes, the more you increase these numbers, the longer it takes R to find all solutions. Depending on the values of skewness and kurtosis, this could run for several minutes (like 5min-10min) to a full hour. Or more. This example is easy, so it’ll only take a few minutes. So one would do something like:
gh_roots(skew=2, kurt=7, m=10, tries=100)
[1] "3 solution(s) found for a skewness of 2 and a kurtosis of 7"
$g
[,1] [,2] [,3]
Var1 .4613686 .3998958 6.749384
$h
[,1] [,2] [,3]
Var1 .02466843 .0464031 -78.60594
You’ll get some other numbers that don’t really matter as well as a nice arrangement at the end that pairs each g parameter with its corresponding h parameter. For my purposes, let’s say I want to use the first pair so that g=.4613686
and h=.02466843
Now that you’ve chosen the parameters that will give you the population skewness and excess kurtosis of your choice, the next step is figuring out what the intermediate correlation matrix should be. If the concept of the ‘intermediate correlation matrix’ is new to you, please read this entry blog of mine.
To actually simulate the trivariate g-and-h distribution, you’ll now use the multgh()
function. This one takes as arguments: (1) Sample size N (default set at 30); (2) vector of g parameters; (3) vector of h parameters; and (4) a population correlation matrix Sigma of your choice. Remember, Sigma needs to be the correlation matrix that *you*, the user, want as a population parameter. mutlgh()
will use the theoretical framework described in Kowalchuk & Headrick (2010) , to calculate the intermediate correlation and actually simulate the data with g-and-h distributed marginals.
So, for our example above we can do:
set.seed(123)
n <- 10000
g1 <- c(.4613686, .4613686, .4613686)
h1 <- c(.02466843, .02466843, .02466843)
Sigm <- matrix(c(1, .5, .5, .5, 1, .5, .5, .5, 1),3, 3)
dat <- multgh(N=n, g=g1, h=h1, Sigma=Sigm)$data
And if you do:
> cor(dat)
[,1] [,2] [,3]
[1,] 1.0000000 0.5039843 0.5211398
[2,] 0.5039843 1.0000000 0.4989272
[3,] 0.5211398 0.4989272 1.0000000
And use the describe()
function in the psych
package you can see that:
> describe(dat)
vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 10000 0.25 1.23 -0.01 0.1 0.97 -2.16 12.75 14.91 1.85 7.01
X2 2 10000 0.24 1.24 -0.01 0.1 0.99 -2.03 16.85 18.88 1.93 9.15
X3 3 10000 0.25 1.24 -0.02 0.1 0.98 -2.34 12.68 15.03 1.73 6.18
So the skewness and (excess) kurtosis match the intended population values. If you’re new to the world of simulation, higher-order moments (skewness, kurtosis and whatnot) require REALLY large samples to be estimated consistently. We’re talking about 100s of thousands or millions. The larger they are, the larger the sample needed. So, for my purposes, I know this works well.