When you first click on the app it looks like this:

What *you*, the user, needs to provide it with is the following:

The **number of predictors**. It can handle anything from 3 to 6 predictors. When you have more than that the overall aesthetics of the app is simply too crowded. The default is 3 predictors.

The **regression coefficients **(i.e, the standardized effect sizes) that hold in the population. The default is 0.3. The app names them “x1, x2, x3,…x6”

The **skewness and excess kurtosis** of the data for each predictor AND for the dependent variable (the app calls it “y”). Please keep on reading to see how you should choose those. The defaults at this point are skewness of 2 and an excess kurtosis of 7.

The **pairwise correlations** **among the predictors**. I think this is quite important because the correlation among the predictors plays a role in calculating the standard error of the regression coefficients. So you can either be VERY optimistic and place those at 0 (predictors are perfectly orthogonal with one another) OR you can be very pessimistic and give them a high correlation (multicollinearity). The default inter-predictor correlation is 0.5.

The **sample size**. The default is 200.

The **number of replications** for the simulation. The default is 100.

Now, * what’s the deal with the skewness and excess kurtosis? *A lot of people do not know this but you cannot go around choosing values of skewness and excess kurtosis all willy-nilly. There is a quadratic relationship between the possible values of skewness and excess kurtosis that specifies they MUST be chosen according to the inequality kurtosis >skewness^2-2 . If you don’t do this, it will spit out an error. Now, I am **not** a super fan of the algorithm needed to generate data with those population-specified values of skewness and excess kurtosis. For many

Now, the exact boundaries of what this method can calculate are actually smaller than the theoretical parabola. However, for practical purposes, as long as you choose values of kurtosis which are sufficiently far apart from the square of the skewness, you should be fine. So, a combo like skewness=3, kurtosis=7 would give it trouble. But something like skewness=3, kurtosis=15 would be perfectly fine.

A hypothetical run would look like this:

So the output under **Results** is the empirical, simulated power for each regression coefficient at the sample size selected. In this case, they gravitate around 60%.

Oh! And if for whatever reason you would like to have all your predictors be normal, you can set the values of skewness and kurtosis to 0. In fact, in that situation you would end up working with a multivariate normal distribution.

]]>`pos_def_limits()`

function from the `faux`

R package works. This package is being developed by the super awesome Dr. Lisa DeBruine who you should, like, totally follow, btw, in case you are not doing it already. What makes this post interesting is that I thought `pos_def_limits()`

was doing one thing but it is doing something else. I think it helps highlight how different ways of approaching the same problems can give you insight into different aspects of it. But first, some preliminaries:
**What is positive definiteness?**

This one is not particularly complicated to point out. Define as an real-valued matrix and as an real-valued, non-zero vector. Then is a positive-definite matrix if and it is positive-**semi**-definite if for all . I prefer this definition of positive definiteness because it generalizes easily to other types of linear operators (e.g., differentiation) as opposed to *consequences* of this definition, which is what we usually operate on. If you come from the social sciences (like I do) the “version” that you know about a matrix (usually, covariance matrix) being positive-definite is that all its eigenvalues have to be positive. Which is what `pos_def_limits()`

relies on. It implements a grid-search over the plausible correlation range of [-1, +1] and, once it finds the minimum and maximum value for which the set of covariance matrices are all positive definite, it produces a result (or it lets you know if said matrix does not exist, which is super useful as well). Relying on the documentation example:

`pos_def_limits(.8, .2, NA)`

> min max

> -0.427 0.747

which means that if you have a correlation matrix that looks like:

then as long as , your resulting matrix is positive definite and, hence, a valid correlation matrix. So far so good.

**How I tackle this problem**

This is what I thought `pos_def_limits()`

was doing under the hood before looking at the source code. So… a similar condition to the positive eigenvalues is that the determinant of the matrix has to be positive. So **IF** is positive-semi- definite, **THEN **. Notice that this DOES NOT work the other way around: just because does not mean that is positive-semi-definite). Anyway, we can rely on the fact that all correlation/covariance matrices are positive-definite by definition, which means the problem of finding the suitable upper and lower bounds is simply solving for subject to the constraint that the determinant MUST be greater than or equal to zero. So with the help of a CAS (Computer Algebra System, my favourite one is MAPLE because I’m very Canadian, LOL) I can see that solving for results in the following:

which is… A QUADRATIC EQUATION! We can graph it and see:

So that any value on the x-axis between both roots yields a valid inequality and, therefore, a positive-definite matrix . Do you remember how to solve for the roots of quadratic equations? Using our trusted formula from highschool we obtain:

which match the values approximated by `pos_def_limits()`

**Extensions to this problem Pt I: Maximizing the determinant**

There are 2 interesting things you can do with this determinantal equation. First and foremost, you can choose *the* value that maximizes the variance encoded in the correlation matrix . The determinant has a lot of interesting properties, including a very nice geometric representation. The absolute value of the determinant is the volume of the parallelepiped described by the column vectors within the matrix. And this generalizes to higher dimensions. Because correlations are bounded in the [-1, +1] range, the maximum determinant that ANY correlation matrix can have is 1 and the minimum is 0. So if I want my matrix to have the largest possible determinant, I only need to choose the value at the vertex of the parabola:

which has coordinates (0.16, 0.3456). So if then the determinant of is at it maximum.

**Extensions to this problem Pt II: What if more than one correlation is missing?**

The “classical” version of this problem is to have a 3 x 3 matrix where 2 correlations are known and 1 is missing. But (as my student questioned further) what would happen if we had, say a 4 x 4 matrix and TWO correlations were missing? Well, no biggie. Let’s come up with a new matrix, let’s call it with the following form:

Yeah, I had to make up a couple of extra correlations (0.3 and 0.5) to make sure only TWO correlations were missing. But there is a point to this that you will notice very quickly. Since we still want to be a valid correlation matrix, the condition must still hold, irrespective of the dimensions or how many missing elements has. So, once again, running this new matrix through the CAS system yields the condition:

Which generates a system of quadratic inequalities which must be solved simultaneously to yield valid ranges. Rather than showing you the equations (booooooring! :D) let me show you something prettier. A picture of the solution space:

Yup. Any combination of points (a,r) within the red regions would yield a valid solution to the inequality above. HOWEVER, there is only ONE region where the solution is both within the valid (i.e., red) regions AND gives us solutions in the valid correlation range. What does your intuition tell you? I think we both know where to look

That is correct! That little blob-looking thingy is where we want to be at. ANY combination of values within the blob is both between [-1, +1] AND satisfies the determinantal equation so any pair of values STRICTLY inside the blob will yield a valid correlation matrix .

“But Oscar — you may ask– how do we choose the one pair of values that maximizes the determinant of ?” Well, that’s a good question! It is not difficult to answer, but it does require a little more mathematics than the case where only one is missing. What we need is to, first, take the partial derivatives with respect to both r and a and set them to 0 (we need to find maximums):

So now we have a system of quadratic equations. There are two equations and two unknowns so we know this system is just-identified and *has* a solution. Again, when you throw them in MAPLE you end up with multiple values for a and r that maximize . There was only one pair of solutions, though, which was both within the valid range [-1, +1] AND inside the special blob:

The last thing we need to check, however, is whether these points are minimums, maximums or saddle points as per the 2nd derivative test.

Which means I need all the 2nd partial derivatives from the equation above as:

Calculate the discriminant setting a=0.473863 and r=-0.038687 :

Which is greater than 0. So (a,r) are either minimums or maximums. The final check is to see if to make sure it’s a local maximum. And since , then it follows that the point (a,r) indeed maximizes the determinant*****.

**TWO potential future directions **

While working on this I noticed a couple of peculiarities that I think are sufficiently mathematically tractable for me to handle and turn into an actual article. UNLESS you (my dear reader) already know the answer to this. I am, after all, a lazy !@#$#% which means that if someone already worked out a formal proof for it, I’d much rather read it than having to come up with it on my own. Let us start with the easy one:

**(1) (Easy): The range of plausible correlations shrinks as the dimensions of the correlation matrix increase**

This one is, I think, easy to see. Let’s start with the basic 2×2 correlation matrix:

Then if (read as “element of”) then is a valid correlation matrix and only becomes positive-**semi**-definite if either or . But you get the gist, all the valid correlation range applies.

Notice how the range of shrunk for the 3×3 matrix of our example to:

And this fact is independent of what correlation matrix you have. You cannot get any correlation matrix where, given that 2 of them are known, the resulting range of is the full valid interval AND it is still positive definite.

So, what happens if we go back to our 4 x 4 example with ? Just for kicks and giggles, let’s make ? so that we are only left with 1 unknown. The new matrix looks like this:

If we run this new through `pos_def_limits()`

(it can handle any number of dimensions) we get:

`pos_def_limits(.8, .2,.3,.NA, 5,.3)`

> min max

> -0.277 0.734

Yup, the range has now shrunk. But we don’t quite yet know why. Let’s try it now through the determinantal equation:

And solving for the roots of this equation we get:

Yup, same answer. But notice something interesting. The quadratic coefficient has now shrunk from 1 to -0.91. And, if you remember back from high school, you know that the coefficient of the quadratic term dictates how wide or narrow the parabola is so that if it is outside the [0, 1] range the parabola is wider (i.e., the roots are farther apart) and if it it’s within [0, 1] the parabola is narrower (i.e., the roots are closer together). Which prompts me to make the following claim:

*Claim: As the dimensions of the correlation matrix increase arbitrary, the valid range that makes it positive definite shrinks until it collapses to a *single* point. In other words, for a large enough correlation matrix, only ONE value can make it positive definite.*

This one shouldn’t be particularly difficult to prove. All I need to show is that the leading coefficient of the quadratic term shrinks as a function of the dimensions of the correlation matrix until it becomes 0. In which case you’ll have a straight line (not a parabola). And that means you’d get only 1 root (and not 2 that encompass a range).

**(1) (Hard): If I sample values for r from the valid correlation range uniformly, the distribution of the determinants concentrate around *the* value of r that maximizes the determinant. **

This one is a little bit more difficult to explain, but let me show you an interesting thing I found. Let’s use the classic 3 x 3 matrix case and only focus on . We know from above that if then the value of is maximized. Now, let’s use R (the programming language, not the matrix) to uniformly sample random values of it, calculate the determinants and plot them:

n<-10000

r<-runif(n, min=-.427, max=.747)

for (i in 1:n) {

R<-matrix(c(1,.8,.2,.8,1,r[i],.2,r[i],1),3,3)

pp[i]<-det(R)

}

dat<-data.frame(pp)

a<- density(pp)

mmod<-a$x[a$y==max(a$y)]

p<-ggplot(dat, aes(x=pp))+geom_density(fill="lightgreen", alpha=.4, size=1)

p+ geom_vline(aes(xintercept=mmod),

color="red", linetype="dashed", size=1)+theme_bw()+xlab("Determinant")+ylab("")

Compare the mode of the distribution to the theoretical, maximum possible determinant:

>R1<-matrix(c(1,.8,.2,.8,1,.16,.2,.16,1),3,3) > det(R1)

[1] 0.3456

> mmod ##this is the mode of the distribution above

[1] 0.3344818

Close within 0.011 error. Which leads me to make the following claim:

*Claim: For the “missing correlation” problem, the distribution of the determinants of the correlation matrix concentrate around the value of r that maximizes it. *

I honestly have no clue how any of these two results would be useful once they are formalized. I am sensing that something like this may be able to play a role in error detection or diagnosing Heywood cases in Factor Analysis? I mean, for the first case (error detection) say you find a correlation matrix within the published literature that is not positive definite. If correlation matrices tend to concentrate around values that maximize their determinants, then you could potentially use this framework to pick and choose ranges of possible sets of correlations to point out where the problem may lie. A similar logic could be used for Heywood cases, ESPECIALLY if the dimensions of the correlation matrix are large. Or maybe all of this is BS and a very elaborate excuse for me to procrastinate. The world will never know ¯\_(ツ)_/¯

It is very straightforward to use. When you click on the app, it will look like this:

What *you*, the user, needs to provide it with is the following:

The **population correlation** (i.e., the effect size) for which you would like to obtain power. The default is 0.3

The type of distributions you would like to correlate together. Right now it can handle the chi-square distribution (where skewness is controlled through the degrees of freedom), the uniform distribution (to have a symmetric distribution with negative kurtosis) and the binomial distribution where one can control the number of response categories (size) and the probability parameter. This will soon be replaced by a multinomial distribution so that the probability of every marginal response option can be specified.

The **sample size**. The default is 20

The **number of replications** for the simulation. The default is 100.

Now, what is the app actually doing? It runs R underneath and it is going to give you the estimated power of the t-test under both conditions. The first one is calculated under the assumption that your data are bivariate normally distributed. This is just a direct use of the `pwr.r.test`

function of the `pwr`

R package. Should give you answers very close or exactly the same as G*Power. I chose to use it for sake of comparison.

What comes out is something that looks like this:

So, on top, you’re going to get the power as if you were using G*Power (or the `pwr`

R package) which is exact and needs no simulation because we have closed-form expressions for it. On the bottom you are going to get the approximated power using simulations. Remember, when working with non-normal data you can’t always expect power to be lower, as in this case. Sometimes it may be higher and sometimes it won’t change much. Yes, at some point (if your sample size is large enough) the distribution of your data will matter very little. But, in the meantime, at least you can use this one to guide you!

Finally, here’s the link for the shiny web app:

]]>If you work with latent variable models, Factor Analysis, Structural Equation Modelling, Item Response Theory, etc. there’s a good chance that you have either encountered or have seen some version of a warning about a covariance matrix being “non positive definite”. This is an important warning because the software is telling you that your covariance matrix is not a valid covariance matrix and, therefore, your analysis is suspect. Usually, within the world of latent variables we call these types of warnings Heywood cases .

Now, most Heywood cases are very easy to spot because they pertain to one of two broad classes: negative variances or correlations greater than 1. When you inspect your matrix models and see either of those two cases, you know exactly which variable is giving you trouble. Thing is (as I found out a few years ago), there are other types of Heywood cases that are a lot more difficult to diagnose. Consider the following matrix that I once got helping a student with his analysis:

............space lstnng actvts prntst persnl intrct prgrmm

space 1.000

lstnng 0.599 1.000

actvts 0.706 0.646 1.000

prntst 0.702 0.459 0.653 1.000

persnl 0.591 0.582 0.844 0.776 1.00

intrct 0.627 0.964 0.501 0.325 0.639 1.000

prgrmm 0.493 0.602 0.981 0.687 0.944 0.642 1.000

This is the model-implied correlation matrix I obtained through the analysis which gave a Heywood case warning. The student was a bit puzzled because, although we had reviewed this type of situations in class, we had only mentioned the case of negative variances or correlations greater than one. Here he neither had a negative variance nor a correlation greater than one… but he still got a warning for non-positive definiteness.

My first reaction was, obviously, to check the eigenvalues of the matrix and, lo and behold, there it was:

[1] 5.01377877 1.00744933 0.62602056 0.30393170 0.16671742 0.01317704 -0.13107483

So… yeah. This was, indeed, an invalid correlation/covariance matrix and we needed to further diagnose where the problem was coming from… but how? Enter our good friend, linear algebra.

If you have ever taken a class in linear algebra beyond what’s required in a traditional methodology/statistics course sequence for social sciences, you may have encountered something called the minor of a matrix. Minors are important because they’re used to calculate the determinant of a matrix. They’re also important for cases like this one because they break down the structure of a matrix into simpler components that can be analyzed. Way in the back of my mind I remembered from my undergraduate years that positive-definite matrices had something special about their minors. So when I came home I went through my old, OLD notes and found this beautiful theorem known as Sylvester’s criterion:

*A Hermitian matrix is positive-definite if and only if all of the leading principal minors have positive determinant.*

All covariance matrices are Hermitian (the subject for another blogpost) so we’re only left to wonder what is a principal minor. Well, if you imagine starting at the [1,1] coordinate of a matrix (so really the upper-left entry) and going downwards diagonally expanding one row and column at a time you’d end up with the principal minors. A picture makes a lot more sense for this:

So… yeah. The red square (so the q11 entry) is the first principal minor. The blue square (a 2 x 2 matrix) is the 2nd principal minor, the yellow square (a 3 x 3 matrix) is the 3rd principal minor and on it goes until you get the full n x n matrix. For the matrix Q to be positive-definite, all the n-1 principal minors need to have positive determinant. So if you want to “diagnose” your matrix for positive definiteness, all you need to do is start from the upper-left corner and check the determinants consecutively until you found one that is less than 0. Let’s start from the previous example. Notice that I’m calling the matrix ‘Q’:

1

> det(Q[1:2,1:2])

[1] 0.641199

> det(Q[1:3,1:3])

[1] 0.2933427

> det(Q[1:4,1:4])

[1] 0.01973229

> det(Q[1:5,1:5])

[1] 0.003930676

> det(Q[1:6,1:6])

[1] -0.003353769

The first [1,1] corner is a given (it’s a correlation matrix so it’s 1 and it’s positive). Then we move downwards the 2×2 matrix, the 3×3 matrix… all the way to the 5×5 matrix. By then the determinant of Q is very small so I suspected that whatever the issue might be, it had to do with the relationship of the variables “persnl”, “intrct” or “prgrmm”. The final determinant pointed towards the culprit. Whatever problem this matrix exhibited, it had to do with the relationship between “intrct” and”prgrmm”.

Once I pointed out to the student that I suspected the problem was coming from either one of these two variables, a careful examination revealed the cause: The “intrct” item was a reverse-coded item but, for some reason, several of the participants respondents were not reverse-coded. So you had a good chunk of the responses to this item pointing to one direction and a smaller (albeit still large) number pointing to the other direction. The moment this item was full reverse-coded the non-positive definite issue disappeared.

I guess there are two lessons to this story: (1) Rely on the natural structure of things to diagnose problems and (2) Learn lots of linear algebra

]]>The theory of what I am going to talk about is developed in said article but when I was cleaning more of my computer files I found an interesting example that didn’t make it there. Here’s the gist of it:

I really don’t get why people say that the Spearman correlation is the ‘robust’ version or alternative to the Pearson correlation. Heck, even if you simply google the words Spearman correlation the second top hit reads

* Spearman’s rank-order correlation is the nonparametric version of the Pearson product-moment correlation*

When I read that, my mathematical mind immediately goes to “if this is the non-parametric version of the Pearson correlation, that means it also estimates the same population parameter”. And honest to G-d (don’t quote me on that one, though. But I just *know* it’s true) I feel like the VAST MAJORITY of people think exactly that about the Spearman correlation. And I wouldn’t blame them either… you can’t open an intro textbook for social scientists that doesn’t have some dubious version of the previous statement. “Well” – the reader might think- “if that is not true then why aren’t more people saying it?” The answer is that, for better or worse, this is one of those questions that’s very simple but the answer is mathematically complicated. But here’s the gist of it (again, for those who like theory like I do, read the article).

The Spearman rank correlation is defined, in the population, like this:

where the are uniformly-distributed random variables and the is the copula function that relates them (more on copulas here.) By defining the Spearman rank correlation in terms of the lower-dimensional marginals it co-relates (i.e. the u’s) and the copula function, it becomes apparent that the overlap with the Pearson correlation depends entirely on what is. Actually, it is not hard to show that if is a Gaussian copula, and the marginals are normal, then the following identity can be derived:

That relates the Pearson correlation to the Spearman correlation. We’ve known this since the times of Pearson because he came up with it (albeit not explicitly) and called it the “grade correlation”. But from this follows the obvious. An identity such as the one described above need not exist. There could be copula functions for which the Spearman and Pearson correlation have a crazy, wacky relationship… and that is what I am going to show you today.

I don’t remember why but I didn’t include this example in the article, but it is a very efficient one. It shows a case where the Spearman correlation is close to 1 but the Pearson correlation is close to 0 (obviously within sampling error).

Define , and . if you use R to simulate a very large sample (say 1 million) then you can find the following Spearman and Pearson correlations:

N <- 1000000

Z <- rnorm(N, mean=0, sd=0.1)

X <- Z^201

Y <- exp(Z)

> cor(X,Y, method="pearson")

[1] 0.004009381

> cor(X,Y, method="spearman")

[1] 0.9963492

The trick for this example is to notice that the rate of change of is microscopic and oscillating around 0 whereas the change in is fairly small and oscillating around 1.

In any case, the point being that even without being necessarily too formal, it’s not overly difficult to see that the Spearman correlation is its own statistic and estimates its own population parameter that may or may not have anything to do with the Pearson correlation, depending on the copula function describing the bivariate distribution.

]]>

WARNING (1): This app can take a little while to run. Do not close your web browser unless it gives you an error. If it appears ‘stuck’ but you haven’t got an error it means the simulation is still running on the background.

WARNING (2): If you keep getting a ‘disconnected from server’ error, close down your browser and open a new window. If the problem still persists that means too many people have tried to access it during the day and the server has shut down. This app is hosted on a free server and it can only accommodate a certain number of people every day.

This app will perform computer simulations to estimate power for multilevel logistic regression models allowing for continuous or categorical covariates/predictors and their interaction. The continuous predictors come in two types: normally distributed or skewed (i.e. χ^{2 }with 1 degree of freedom). It currently only supports binary categorical covariates/predictors (i.e. Bernoulli-distributed) *but *with the option to manipulate the probability parameter *p* to simulate imbalance of the groups.

The app will give you the power for each individual covariate/predictor AND the variance component for the intercept (if you choose to fit a random-intercept model) or the slope (if you choose to fit a model with both a random intercept and a random slope). It uses the Wald test statistic for the fixed effect predictors and a 1-degree-of-freedom likelihood-ratio test for the random effects (← yes, I know this is conservative but it’s the fastest one to implement).

When you open the app, here’s how it looks:

What **you**, as the user, need to provide is the following:

The** Level 1 and Level 2** sample sizes. If I were to use the ubiquitous example of “children in schools” the Level 1 sample would be the children (individuals within a cluster) and the Level 2 sample would be the schools (number of clusters). For demonstration purposes here I’m asking for groups of 50 ‘children’ in 10 ‘schools’ for a total sample size of 50×10 = 500 children.

The **variance for the random effects.** You can either choose to fit an intercept-only model (so no variance of the slope) or a random intercept AND random slope model. You *cannot *fit a random-slope only model here and you *cannot* set the variances at 0 to fit a single-level logistic regression (there’s other software to do power analysis for single-level logistic regression). At least the variance of the intercept needs to be specified. Notice that the app defaults to an intercept-only model and under ‘Select Covariate’ it will say ‘None’. That changes when you click on the drop-down menu where it gives you the option of which random slope do you want. Notice that you can only choose *one* predictor to have a random slope. Will work on the general case in the future.

The **number of covariates** (or predictors) which I believe is pretty self-explanatory. Just notice that the more covariates you add, the longer it will take for the simulation to run. The default in the app is 2 covariates.

This would be the core of the simulation engine because the user needs to specify:

**Regression coefficients (‘Beta’).**This space lets the user specify the effect size for the regression coefficients under investigation. The default is 0.5 but that can be changed to any number. In the absence of any outside guidance, Cohen’s small-medium-large effect sizes are recommended. Remember that the regression coefficient for binary predictors is conceptualized as a standardized mean difference so it should be in Cohen’s*d*metric.**Level of the predictor (‘Level’).**It only supports 2-level models so the options are ‘1’ or ‘2’. This section indicates whether a predictor belongs to the Level 1 sample (e.g. the ‘children’) or the Level 2 sample (e.g. the ‘school). Notice that whichever predictor gets assigned a random slope MUST also be selected as Level 1. Otherwise the power analysis results will not make sense. It currently only supports one predictor at the Level 1 with a random slope. Other predictors can be included at Level 1 but they won’t have the option for a random slope component.**Distribution of the covariates (‘Distribution’)**. Offers 3 options: normally-distributed, skewed (i.e. χ^{2 }with 1 degree of freedom or a skew of about √*8*) and binary/Bernoulli-distributed. For the binary predictor the user can change the population parameter*p*and create imbalance between the groups. So, for instance, if*p=0.3*then 30% of the sample would belong to the group labelled as ‘1’ and 70% to the group labelled as ‘0’. The default for this option is 0.5 to create an even 50/50 split.**Intercept (‘Intercept Beta’).**Lets the user define the intercept for the regression model. The default is 0 and I wouldn’t recommend changing it unless you’re making inferences about the intercept of the regression model.

Once the number of covariates has been selected, the app will offer the user **all possible 2-way interaction effects** irrespective of the level of the predictor and distribution characteristics. The user can select whichever 2-way interaction is of interest and assign an effect size/regression coefficient (i.e. ‘Beta’). The app will use this effect size to calculate power. Notice that the distribution of the interaction is fully defined by the distribution of its constituting main effects.

The number of datasets generated using the population parameters previously defined by the researcher. The default is 10 but I would personally recommend a minimum of 100. The larger the number of replications the more accurate the results will be but also the longer the simulation will take.

The simulated power is calculated as the proportion of statistically significant results out of the number of simulated datasets and will be printed here. Notice the time progress bar indicating that the simulation is still running. For a 2-covariate model with both a random effect for the intercept and the slope the simulation took almost 3 min to run. Expect longer waiting times if the model has lots of covariates.

This is what a sample of a full power analysis looks like. The estimated power can be found under the column ‘Power’. The column labelled ‘NA’ shows the proportion of models that did not converge. In this case, all models converged (there are 0s all throughout the NA column) but the power of the fixed and random effects is relatively low with the exception of the power for the variance of the random intercept. In this example one would need to either increase the effect size from 0.5 to something larger or increase the Level 1 and Level 2 sample sizes in order to obtain acceptable power levels of 80%. You can either download your power analysis results as a .csv file or copy-paste them by clicking on the appropriate button.

Finally, here is the link for the shiny web app:

– Give you the polychoric (or tetrachoric, in case of binary data) correlation matrix

– Do Parallel Analysis and a scree plot based on the polychoric (or tetrachoric) correlation matrix

– Calculate ordinal alpha as recommended in:

It currently takes in certain SPSS files (so .sav file extensions from older versions of SPSS, say around 2013 or less), Microsoft Excel files (so .xls file extensions) and comma-delimited files (so .csv extensions). If your data is in none of those files, please change it before using the app (it’s super easy). or it will give you an error. Also notice that the app will use ALL of the variables in the file uploaded, so make sure you upload a file that only has the variables (test items in most cases) which you want to correlate/calculate alpha for. You’ll need to provide a clean dataset for it to work. So if you have missing values, you’ll need to manually remove them before submitting it. If there are outliers, those need to be dealt with before using the app.

Please notice that in accordance to research, if you have 8 (or more) Likert responses, the app will give you an error saying you have enough categories to safely treat your variables as continuous, so you don’t really need to use this app. You can see why in Rhemtulla, Brosseau-Liard & Savalei (2012).

*In a traditional ANOVA setting (fixed effects, fully-balanced groups, etc.)… Does one test the normality assumption on the residuals or the dependent variable?*

Ed’s answer (as well as my talkstats.com friends): On the residuals. ALWAYS.

My answer: Although the distributional assumptions for these models is on the residuals, *for most designs found in education or social sciences it doesn’t really matter whether you use the residuals or the dependent variable.*

Who is right, and who is wrong? The good thing about Mathematics (and Statistics as a branch of Mathematics) is that there’s only one answer. So either he is right or I am. Here are the two takes to the answer with a rationale.

**Ed is right.**

This is a simplified version of his answer that was also suggested on talkstats. Consider the following independent-groups t-test as shown in this snippet of R code. I’m assuming if you’re reading this you know that a t-test can be run as a linear regression.

dv1 <- rnorm(1000, 10, 1)

dv2 <- rnorm(1000, 0, 1)

dv <- c(dv1, dv2)

g <- as.factor(rep(c(1,0), each=1000))

dat<-data.frame(dv,g)

res <- as.data.frame(resid(lm(dv~g)))

colnames(res)<-c("residual")

If you plot the dependent variable, it looks like this:

And if you plot the residuals, they look like this:

Clearly, the dependent variable is not normally distributed. It is bimodal, better described as a 50/50 Gaussian mixture if you wish. However, the residuals are very much bell-shaped and.. well, for lack of a better word, *normally distributed*. If we wanted to look at it more formally, we can conduct a Shapiro-Wilks test and see that it is not statistically significant.

shapiro.test(res$residual)

`Shapiro-Wilk normality test`

`data: res$residual`

W = 0.99901, p-value = 0.3432

So… yeah. Testing the dependent variable would’ve led someone to (erroneously) conclude that the assumption of normality was being violated and maybe this person would’ve ended up going down the rabbit hole of non-parametric regression methods… which are not bad *per se*, but I realize that for people with little training in statistics, these methods can be quite problematic to interpret. So Ed is right and I am wrong.

**I am right.**

When this example was put forward I pointed out to Ed (and other people involved in the discussion) to look at the assumption made regarding the population effect size. That’s a Cohen’s *d* of 10! Let’s see what happens when you run what’s considered a “large” effect size within the social sciences. Actually, let’s be very, very, VERY generous and jump straight from a Cohen’s *d* of 0.8 (large effect size) to a Cohen’s *d* of 1 (super large effect size?).

The plot of the dependent variable now looks like this:

And the residual plot looks like this:

Uhm… both the dependent variable and the residuals are looking very normal to me. What if we test them using the Shapiro-Wilks test?

`shapiro.test(res$residual)`

` Shapiro-Wilk normality test`

data: res$residual

W = 0.99926, p-value = 0.6328

`shapiro.test(dv)`

`Shapiro-Wilk normality test`

data: dv

W = 0.99944, p-value = 0.8515

Yup, both are pretty normal-looking. So, in this case, whether you test the dependent variable or the residuals you end up with the same answer.

Just for kicks and giggles, I noticed that you needed a Cohen’s *d* of 2 before the Shapiro-Wilks test yielded a significant result, but you can see that the *W* statistics are quite similar between the previous case and this one. And we’re talking about sample sizes of 2000. Heck, even the plot of the dependent variable is looking pretty bell-shaped

`shapiro.test(dv)`

`Shapiro-Wilk normality test`

data: dv

W = 0.99644, p-value = 8.008e-07

This is why I, in my response, I included the addendum of *for most designs found in education or social sciences it doesn’t really matter whether you use the residuals or the dependent variable*. A Cohen’s *d* of 2 is two-and-a-half units larger than what’s considered a large effect size in my field. If I were to see such a large effect size, I’d probably think something funky going on with the data than actually believing that such a large difference can be found. Ed, comes from a natural science background. And I know that, in the natural sciences, large effect sizes are pretty common-place (in my opinion it comes down to the problem of measurement that we face in the social sciences).

As you can see now, the degree of agreement between the normality tests of the dependent variable and the residuals is a function of the effect size. The larger the effect size, the larger the difference between the shapes of the distributions of the residuals VS the dependent variable (within this context, of course. This is not true in general).

Strictly speaking, Ed and the talkstats team are right in the sense that you can never go wrong with testing the residuals. Which is what I pointed out in the beginning as well. My applied research experience, however, has made me more practical and I realize that, in most cases, it doesn’t really matter. And at a certain sample size, the normality assumption is just so irrelevant that even testing it may be unnecessary. But anyway, some food for thought right there

]]>For some reason that I can’t quite remember, I thought it would be a good idea to investigate the properties of Lawley’s test for the equality of correlations in a correlation matrix. So the null hypothesis of this test kind of looks like this:

I have absolutely no idea why I thought this was a good idea. I’m thinking it may have been back in the days where I was not as proficient in Structural Equation Modelling (SEM) so I’d dwell in my mathematical training to accomplish things that can be trivially set up as SEMs. Following on my poorly-documented code and meager notes, it seems like I thought I could tweak the test to see whether or not the assumption of Parallel Tests or Tau-Equivalent Tests could be evaluated. You know, to make sure people would know whether their Cronbach’s alpha was a true estimate of reliability or just a lower bound. Little did I know at that point that people just use alpha whether the assumptions behind it hold or not. The truth is nobody really cares.

The one thing I remember is that I couldn’t find any R package that would run the test. So I ended up coding it myself. I’m not sure if this would be of any use to anyone, but I’m hoping this may save some time to someone out there who may need it for whatever reason.

So the R function you’d have to declare is:

lawley <- function(data) {

```
```R <- cor(data)

m <- cor(data)

r <- nrow(m)

c <- ncol(m)

u <- c

z <- lower.tri(c,diag = FALSE)

u[z] <- 0

l <- c

z1 <- upper.tri(c,diag = FALSE)

l[z1] <- 0

p <- r

n <- nrow(data)

mcp <- 0*1:ncol(m)

for (i in 1:ncol(m))

{

mcp[i] <- (sum(m[,i])-1)/(ncol(m)-1)

} # vector of average values of non-diagonal elements of each row

lR <- R

z2 <- upper.tri(R,diag = FALSE)

lR[z2] <- 0

mc <- (2/(p*(p-1))) * sum(R-lR)

TR <- R

z3 <- upper.tri(R,diag = TRUE)

TR[z3] <-0

A <- (TR - mc)^2

A[z3] <-0

A <- sum(A)

B <- sum((mcp-mc)^2)

C <- ((p-1)^2*(1-(1-mc)^2))/(p-((p-2)*(1-mc)^2))

X2 <- (((n-1)/((1-mc)^2))*(A-C*B))

v <- ((p+1)*(p-2))/2

P <- 1-pchisq(X2,v,ncp = 0, lower.tail = TRUE, log.p = FALSE)

result <- rbind(c(X2, v, P))

dimnames(result) <- list(c("X2"), c("statistic", "df", "p.value"))

`print(result)`

}

And the way you use it is very simple. Let’s try it with a mock dataset we’ll generate in the R package `lavaan`

. Here we’re generating data from a One Factor model with equal loadings of 0.3 and equal error variances:

library(lavaan)

set.seed(123)

```
```pop = 'f1 =~ 0.3*x1 + 0.3*x2 + 0.3*x3 + 0.3*x4 + 0.3*x5'

dat = simulateData(model=pop, sample.nobs=500)

`lawley(dat)`

statistic df p.value

X2 7.840194 9 0.5503277

So… yeah. Non-significant p-value (as per glorious alpha of .05) tells us that the null hypothesis is true and the population correlation matrix indeed has equal elements all around.

Let’s change one loading and see what happens:

library(lavaan)

set.seed(123)

```
```pop = 'f1 =~ 0.4*x1 + 0.3*x2 + 0.3*x3 + 0.3*x4 + 0.3*x5'

dat = simulateData(model=pop, sample.nobs=500)

`lawley(dat)`

statistic df p.value

X2 18.78748 9 0.02706175

So the null hypothesis is now rejected because one of the loadings is different so there is one element of the covariance matrix that is not equal to every other element.

I literally have no clue why Lawley thought having such a test was a good idea. But then again I was investigating it a few years ago so maybe *I* thought the test was a good idea in the first place.

Anyhoo, I hope this helps someone if they need it.

]]>Let me show you why.

Say we have the very simple scenario of calculating power for the easy-cheesy *t-test* of the Pearson correlation coefficient. We are going to be extra indulgent with ourselves and claim the population effect size is (so a LARGE effect size à la Cohen). If you plug in the usual specifications in G*Power (Type I error rate of .05, desired power of 0.8, population effect size of against the null of ) this is what we get:

So your sample size should be 26. Just for kicks an giggles, I simulated the power curve for this exact scenario and marked with a line where the 80% power would be located.

Same answer as with G*Power, somewhere a little over *n=25*. Pretty straightforward, right? Well… sure… if you’re comfortable with **the assumption that your data is bivariate normal.** Both in the R simulation I made and in G*Power, the software assumes that your data looks like this:

For even more kicking and giggling, let’s assume your data is NOT normal (which, as we know, is the more common case). In this particular instance, both variables are -distributed with 1 degree of freedom (quite skewed). Each variable looks like this:

And their joint density (e.g. if you do a scatterplot), looks like that:

But here’s the catch… because of how I simulated them (through a Gaussian copula if you’re wondering), they both have the *same* population effect size of 0.5. What does the power curve look like in this case? It looks like this:

So that for the same large population effect size, you need a little over **TWICE** the sample size to obtain the same 80%.

You see where I’m going with this? Where’s the my-data-is-not-normal option in G*Power? Or my data-has-missing-values? Or my-data-has-measurement error? Or my data has all of those at once? Sure, I realize that this is a little bit of an extreme case because the sample size is not terribly large, the non-normality is severe and by the time *n=100*, the malicious influence of the non-normality has been “washed away” so to speak. The power curves look more and more similar as the sample size grows larger and larger. But it is still a reminder that every time I see people report their power analyses through G*Power my mind immediately goes to… “is this really power or a lower/upper bound to power?” And, moreover… if you go ahead, do your analyses and find your magic *p-value* under .05, you’re probably going to feel even *more* confident that your results are the real deal, right? I mean, you did your due diligence, you’re aware of the issues and you tried to address them the best way you could. And that’s exactly what kills me. Sometimes your best is just not good enough.

Solutions? Well… I dunno. Unless someone makes computer simulations mandatory in research methods classes, the only other option I have is usually to close my eyes and hope for the best.

]]>