Author Archives: oscarolvera

Multivariate Poisson & Multivariate Exponential Distributions (not everything needs a copula ;)

While I am preparing for a more “in-depth” treatment of this Twitter thread that sparked some interest (thank my lucky stars!)¬†, I ran into a couple of curious distributions that I think highlight an important lesson: not EVERYTHING needs a copula ūüėČ Sure, they’re flexible and powerful modelling approaches. But at the end of the day there are other methods to create multivariate, non-normal distributions. And many of them have very elegant theoretical properties that allow us to expand our the intuitions about these distributions from the univariate to the multivariate setting.

So here I present two distributions which can be generalized from their univariate to a multivariate definition without invoking a copula.

Multivariate Poisson Distribution

The Poisson distribution is closed under convolutions. For purposes of this post, that means that if X_{1} and X_{2} are independent, Poisson-distributed (with parameters \lambda_{1}, \lambda_{2} respectively) then Y=X_{1}+X_{2} is also¬†Poisson-distributed, (with parameter… Yup! You got it! \lambda_{Y}=\lambda_{1}+\lambda_{2} ).

So let’s start easy, the bivariate case. To construct a bivariate Poisson random vector (X,Y)' we can use the following stochastic representation:

X = X_{1} + X_{1,2}

Y = X_{2} + X_{1,2}

Where X_{1}, X_{2}, X_{1,2} are independent Poisson random variables with parameters \lambda_{1}, \lambda_{2}, \lambda_{1,2} respectively. Since Poisson distributions are closed under convolutions, X and Y are Poisson distributed with variance \lambda_{1}+\lambda_{1,2}, \lambda_{2}+\lambda_{1,2}  respectively, and covariance Cov(X, Y) = \lambda_{1,2} . Notice that this construction implies the restriction Cov(X, Y) \geq 0 .

Using the above property we can derive the joint probability function of (X,Y)' . Consider P(X=r, Y=s) fixed values r,s Since X, Y, X_{12} can take any values between 0 and min(r,s) and X_{1}, X_{2}, X_{1,2} are mutually independent then we can use this property to define the joint probability function as:

P(X=r, Y=s)=\sum_{k=0}^{min(r,s)}f_1(r-k)f_2(s-k)f_{1,2}(k)

where f_1, f_2, f_{1,2} are the probability functions of X_1, X_2, X_{1,2} respectively. By making the proper substitutions in the f's and some collecting of terms we have:

P(X=r, Y=s)=e^{(-\lambda_1-\lambda_2-\lambda_{1,2})}\frac{\lambda^{r}_{1}}{r!}\frac{\lambda^{s}_{2}}{s!}\sum_{k=0}^{min(r,s)}(r-k)(s-k)k!\left ( \frac{\lambda_{1,2}}{\lambda_1\lambda_2} \right ) ^{k}

From this process I could expand it to, say, a trivariate Poisson random variable by expressing the 3-D vector (X, Y, Z)' as:

X = X_{1} + X_{1,2} + X_{1,3} +X_{1,2,3}

Y = X_{2} + X_{1,2} + X_{2,3} +X_{1,2,3}

                                                    Z = X_{3} + X_{1,3} + X_{2,3} +X_{1,2,3}

Where all the X’s are themselves independent, Poisson distributed and the terms with double (and triple) subscript would control the level of covariance among the Poisson marginal distributions.¬† If you repeat this iteratively adding more and more terms to the summation then you can increase the dimensions of the multivariate Poisson distribution.

Multivariate Exponential Distribution

The univariate exponential distribution is also (sort of) closed under convolution. But, in this very specific case, it’s closed under weighted minima convolution. For this post, that means that if Y_1, Y_2, Y_3, ..., Y_n are independent, exponential random variables, then min(w_1Y_1, w_2Y_2, w_3Y_3, ..., w_nY_n) is also exponentially-distributed for w_1, w_2, w_3, ..., w_n >0.

We can start very similarly as with the previous case by defining how the bivariate distribution would look like. Here we start with:

X = Y_{1} + Y_{1,2} 

Y = Y_{2} + Y_{1,2} 

where Y_1, Y_2, Y_{1,2} are independent, exponentially-distributed random variables with parameters \lambda_{1}, \lambda_{2}, \lambda_{1,2} (I SWEAR to God the standard mathematical notation for the parameter of both the exponential and the Poisson distribution is \lambda. Of course Y_1, Y_2, Y_{1,2} would be chosen as the minima of their respective sequences of exponential random variables. Because of this X, Y are also exponentially-distributed with parameters \lambda_{1}+\lambda_{1,2}, \lambda_{2}+\lambda_{1,2} respectively. Since Y_1, Y_2, Y_{1,2} are independent, then we have:

P(X>x, Y>y)=P(Y_1>x, Y_{1,2}>x, Y_2>y, Y_{1,2}>y)=

e^{ \lambda_1x}e^{\lambda_2y}e^{\lambda_{1,2}max(x,y)}

And the joint cumulative density function of the bivariate vector (X,Y)' would then be:

F(y_1, y_2)=(1-e^{-\lambda_{1}x})(1-e^{-\lambda_{2}y})(1-e^{-\lambda_{1,2}max(x,y)})

Moral of the story

If you know me, you’ll know that I tend to be overly critical of the things I like the most, which is a habit that doesn’t always makes me a lot of friends, heh. Because I like copula modelling and I like the idea of non-normal, multivariate structures, I also like to see and understand the cases where defining multivariate structures that do not need a copula may give us insights. For the particular two cases above, I am exploiting the fact that sums of these types of random variables also result in the same type of random variable (i.e., closed under convolution) which, for better or worse, is a very useful property that not many univariate probability distributions have. Nevertheless, when they *do* have it, it is perhaps wise to use them because, at the end of the day, using copulas to either simulate or model multivariate data does not imply the copula distribution *becomes* the “multivariate” version of that distribution. A Gaussian copula with gamma-distributed marginals is not a “multivariate gamma distribution”. It is… well… a Gaussian copula with gamma distributed marginals. If you change the copula function for something else (say an Archimidean copula) the multivariate properties change as well. Which brings us to a very sobering realization: with the exception of some very select types of multivariate distributions (usually those closed under convolution) we don’t always have well-defined extensions of multivariate distributions. For example, there are more than 10 different ways to define distributions that would satisfy what one would call a “multivariate t distribution”. Whichever characterization one chooses is usually contingent on the intended use for it. I guess this helps highlight one of the oldest pieces of advice I ever received: You know you have become a good methodologist when you realize the only correct answer to every data analysis question is simply “it depends”.¬†

Power analysis for multiple regression with non-normal data

This app will perform computer simulations to estimate the power of the t-tests within a multiple regression context under the assumption that the predictors and the criterion variable are continuous and either normally or non-normally distributed. If you would like to see why I think this is important to keep in mind, please read this blog post.

When you first click on the app it looks like this:

mult1

What *you*, the user, needs to provide it with is the following:

The number of predictors. It can handle anything from 3 to 6 predictors. When you have more than that the overall aesthetics of the app is simply too crowded.  The default is 3 predictors.

The¬†regression coefficients¬†(i.e, the standardized effect sizes) that hold in the population. The default is 0.3. The app names them “x1, x2, x3,…x6”

The skewness and excess kurtosis of the data for each predictor AND for the dependent variable (the app calls it “y”). Please keep on reading to see how you should choose those. The defaults at this point are skewness of 2 and an excess kurtosis of 7.

The pairwise correlations among the predictors. I think this is quite important because the correlation among the predictors plays a role in calculating the standard error of the regression coefficients. So you can either be VERY optimistic and place those at 0 (predictors are perfectly orthogonal with one another) OR you can be very pessimistic and give them a high correlation (multicollinearity). The default inter-predictor correlation is 0.5.

The sample size. The default is 200.

The number of replications for the simulation. The default is 100.

Now,¬†what’s the deal with the skewness and excess kurtosis?¬†A lot of people do not know this but you cannot go around choosing values of skewness and excess kurtosis all willy-nilly. There is a quadratic relationship between the possible values of skewness and excess kurtosis¬†that specifies they¬†MUST be chosen according to the inequality kurtosis >skewness^2-2 . If you don’t do this, it will spit out an error. Now, I am **not** a super fan of the algorithm needed to generate data with those population-specified values of skewness and excess kurtosis.¬†For many¬†AND even more¬†reasons. HOWEVER, I needed to choose something that was both sufficiently straight-forward to implement and not very computationally-intensive, so the 3rd order polynomial approach will have to do.

Now, the exact boundaries of what this method can calculate are actually smaller than the theoretical parabola. However, for practical purposes, as long as you choose values of kurtosis which are sufficiently far apart from the square of the skewness, you should be fine. So, a combo like skewness=3, kurtosis=7 would give it trouble. But something like skewness=3, kurtosis=15 would be perfectly fine.

A hypothetical run would look like this:

mult2

So the output under Results is the empirical, simulated power for each regression coefficient at the sample size selected. In this case, they gravitate around 60%.

Oh! And if for whatever reason you would like to have all your predictors be normal, you can set the values of skewness and kurtosis to 0. In fact, in that situation you would end up working with a multivariate normal distribution.

YOU CAN CLICK HERE TO ACCESS THE APP

Power Analysis for the Pearson correlation with non-normal data

This app will perform computer simulations to estimate the power of the t-test for the Pearson bivariate correlation under the assumption that the data are continuous and either bivariate normally distributed or non-normally distributed. To see why this is important, please check out this blog post.

It is very straightforward to use. When you click on the app, it will look like this:

corr2

What *you*, the user, needs to provide it with is the following:

The population correlation (i.e., the effect size) for which you would like to obtain power. The default is 0.3

The type of distributions you would like to correlate together. Right now it can handle the chi-square distribution (where skewness is controlled through the degrees of freedom), the uniform distribution (to have a symmetric distribution with negative kurtosis) and the binomial distribution where one can control the number of response categories (size) and the probability parameter. This will soon be replaced by a multinomial distribution so that the probability of every marginal response option can be specified.

The sample size. The default is 20

The number of replications for the simulation. The default is 100.

Now, what is the app actually doing? It runs R underneath and it is going to give you the estimated power of the t-test under both conditions. The first one is calculated under the assumption that your data are bivariate normally distributed. This is just a direct use of the pwr.r.test function of the pwr R package. Should give you answers very close or exactly the same as G*Power. I chose to use it for sake of comparison.

What comes out is something that looks like this:

corr

 

So, on top, you’re going to get the power as if you were using G*Power (or the pwr R package) which is exact and needs no simulation because we have closed-form expressions for it. On the bottom you are going to get the approximated power using simulations. Remember, when working with non-normal data you can’t always expect power to be lower, as in this case. Sometimes it may be higher and sometimes it won’t change much. Yes, at some point (if your sample size is large enough) the distribution of your data will matter very little. But, in the meantime, at least you can use this one to guide you!

Finally, here’s the link for the shiny web app:

YOU CAN CLICK HERE TO ACCESS THE APP

Sylvester’s criterion to diagnose Heywood cases.

So… here’s a trick I use sometimes that I’ve never seen anywhere else and it may be helpful to some people.

If you work with latent variable models, Factor Analysis, Structural Equation Modelling, Item Response Theory, etc. there’s a good chance that you have either encountered or have seen some version of a warning about a covariance matrix being “non positive definite”. This is an important warning because the software is telling you that your covariance matrix is not a valid covariance matrix and, therefore, your analysis is suspect. Usually, within the world of latent variables we call these types of warnings Heywood cases¬†.

Now, most Heywood cases are very easy to spot because they pertain to one of two broad classes: negative variances or correlations greater than 1. When you inspect your matrix models and see either of those two cases, you know exactly which variable is giving you trouble. Thing is (as I found out a few years ago), there are other types of Heywood cases that are a lot more difficult to diagnose. Consider the following matrix that I once got helping a student with his analysis:


............space    lstnng    actvts    prntst    persnl    intrct    prgrmm
space       1.000
lstnng      0.599    1.000
actvts      0.706    0.646     1.000
prntst      0.702    0.459     0.653      1.000
persnl      0.591    0.582     0.844      0.776     1.00
intrct      0.627    0.964     0.501      0.325     0.639    1.000
prgrmm      0.493    0.602     0.981      0.687     0.944    0.642      1.000

This is the model-implied correlation matrix I obtained through the analysis which gave a Heywood case warning. The student was a bit puzzled because, although we had reviewed this type of situations in class, we had only mentioned the case of negative variances or correlations greater than one. Here he neither had a negative variance nor a correlation greater than one… but he still got a warning for non-positive definiteness.

My first reaction was, obviously, to check the eigenvalues of the matrix and, lo and behold, there it was:
[1] 5.01377877 1.00744933 0.62602056 0.30393170 0.16671742 0.01317704 -0.13107483

So… yeah. This was, indeed, an invalid correlation/covariance matrix and we needed to further diagnose where the problem was coming from… but how? Enter our good friend, linear algebra.

If you have ever taken a class in linear algebra beyond what’s required in a traditional methodology/statistics course sequence for social sciences, you may have encountered something called the minor of a matrix.¬†Minors are important because they’re used to calculate the determinant of a matrix. They’re also important for cases like this one because they break down the structure of a matrix into simpler components that can be analyzed. Way in the back of my mind I remembered from my undergraduate years that positive-definite matrices had something special about their minors. So when I came home I went through my old, OLD notes and found this beautiful theorem known as Sylvester’s criterion:

A Hermitian matrix is positive-definite if and only if  all of the leading principal minors have positive determinant.

All covariance matrices are Hermitian (the subject for another blogpost) so we’re only left to wonder what is a principal minor. Well, if you imagine starting at the [1,1] coordinate of a matrix (so really the upper-left entry) and going downwards diagonally expanding one row and column at a time you’d end up with the principal minors. A picture makes a lot more sense for this:

SylvesterCriterion

So… yeah. The red square (so the q11 entry) is the first principal minor. The blue square (a 2 x 2 matrix) is the 2nd principal minor, the yellow square (a 3 x 3 matrix) is the 3rd principal minor and on it goes until you get the full n x n matrix. For the matrix Q to be positive-definite, all the n-1 principal minors need to have positive determinant. So if you want to “diagnose” your matrix for positive definiteness, all you need to do is start from the upper-left corner and check the determinants consecutively until you found one that is less than 0. Let’s start from the previous example. Notice that I’m calling the matrix ‘Q’:
1

> det(Q[1:2,1:2])
[1] 0.641199

> det(Q[1:3,1:3])
[1] 0.2933427

> det(Q[1:4,1:4])
[1] 0.01973229

> det(Q[1:5,1:5])
[1] 0.003930676

> det(Q[1:6,1:6])
[1] -0.003353769

The first [1,1] corner is a given (it’s a correlation matrix so it’s 1 and it’s positive). Then we move downwards the 2×2 matrix, the 3×3 matrix… all the way to the 5×5 matrix. By then the determinant of Q is very small so I suspected that whatever the issue might be, it had to do with the relationship of the variables¬†¬†“persnl”,¬† “intrct” or “prgrmm”. The final determinant pointed towards the culprit. Whatever problem this matrix exhibited, it had to do with the relationship between “intrct” and”prgrmm”.

Once I pointed out to the student that I suspected the problem was coming from either one of these two variables, a careful examination revealed the cause: The¬†“intrct” item was a reverse-coded item but, for some reason, several of the participants respondents were not reverse-coded. So you had a good chunk of the responses to this item pointing to one direction and a smaller (albeit still large) number pointing to the other direction. The moment this item was full reverse-coded the non-positive definite issue disappeared.

I guess there are two lessons to this story: (1) Rely on the natural structure of things to diagnose problems and (2) Learn lots of linear algebra ūüôā

The Spearman correlation doesn’t need to tell you anything about the Pearson correlation

I… am a weird guy (but then again, that isn’t news, hehe). And weird people sometimes have weird pet-peeves. Melted cheese on Italian food and hamburgers? Great! I’ll take two. Melted cheese anywhere else? Blasphemy! That obviously extends to Statistics. And one of my weird pet peeves (which was actually strong enough¬† to prompt me to write it as a chapter for my dissertation and a subsequent published article) is when people conflate the Pearson and the Spearman correlation.

The theory of what I am going to talk about is developed in said article¬†but when I was cleaning more of my computer files I found an interesting example that didn’t make it there. Here’s the gist of it:

I really don’t get why people say that the Spearman correlation is the ‘robust’ version or alternative to the Pearson correlation. Heck, even if you simply google the words Spearman correlation¬†the second top hit reads

¬†Spearman’s rank-order¬†correlation¬†is the nonparametric version of the Pearson product-moment correlation

When I read that, my mathematical mind immediately goes to “if this is the non-parametric version of the Pearson correlation, that means it also estimates the same population parameter”. And honest to G-d (don’t quote me on that one, though. But I just *know* it’s true) I feel like the VAST MAJORITY of people think exactly that about the Spearman correlation. And I wouldn’t blame them either… you can’t open an intro textbook for social scientists that doesn’t have some dubious version of the previous statement. “Well” – the reader might think- “if that is not true then why aren’t more people saying it?” The answer is that, for better or worse, this is one of those questions that’s very simple but the answer is mathematically complicated. But here’s the gist of it (again, for those who like theory like I do, read the article).

The Spearman rank correlation is defined, in the population, like this:

int01

where the u_i are uniformly-distributed random variables and the C(\cdot) is the copula function that relates them (more on copulas here.) By defining the Spearman rank correlation in terms of the lower-dimensional marginals it co-relates (i.e. the u’s) and the copula function, it becomes apparent that the overlap with the Pearson correlation depends entirely on what C(u_x,u_y) is. Actually, it is not hard to show that if¬†C(\cdot) is a Gaussian copula, and the marginals are normal, then the following identity can be derived:

\rho_S=\frac{6}{\pi}\sin^{-1}\left ( \frac{\rho}{2} \right )

That relates the Pearson correlation \rho to the Spearman correlation. We’ve known this since the times of Pearson because he came up with it (albeit not explicitly) and called it the “grade correlation”. But from this follows the obvious. An identity such as the one described above need not exist. There could be copula functions for which the Spearman and Pearson correlation have a crazy, wacky relationship… and that is what I am going to show you today.

I don’t remember why but I didn’t include this example in the article, but it is a very efficient one. It shows a case where the Spearman correlation is close to 1 but the Pearson correlation is close to 0 (obviously within sampling error).

Define Z \sim N(0, 0.1) , X=Z^{201} and Y= e^{Z}.  if you use R to simulate a very large sample (say 1 million) then you can find the following Spearman and Pearson correlations:


N <- 1000000
Z <- rnorm(N, mean=0, sd=0.1)
X <- Z^201
Y <- exp(Z)

> cor(X,Y, method="pearson")
[1] 0.004009381

> cor(X,Y, method="spearman")
[1] 0.9963492

The trick for this example is to notice that the rate of change of  X is microscopic and oscillating around 0 whereas the change in Y is fairly small and oscillating around 1.

In any case, the point being that even without being necessarily too formal, it’s not overly difficult to see that the Spearman correlation is its own statistic and estimates its own population parameter that may or may not have anything to do with the Pearson correlation, depending on the copula function describing the bivariate distribution.

 

 

Power Analysis for Multilevel Logistic Regression

::UPDATE::

A published article introducing this app is now online in BMC-Medical Research Methodology. If you plan on using this app, it would be a good idea to cite it ūüėČ

The relationship between statistical power and predictor distribution in multilevel logistic regression: a simulation-based approach

 

WARNING (1): This app can take a little while to run. Do not close your web browser unless it gives you an error. If it appears ‘stuck’ but you haven’t got an error it means the simulation is still running on the background.

WARNING (2): If you keep getting a ‘disconnected¬† from server’ error, close down your browser and open a new window. If the problem still persists that means too many people have tried to access it during the day and the server has shut down. This app is hosted on a free server and it can only accommodate a certain number of people every day.¬†¬†

This app will perform computer simulations to estimate power for multilevel logistic regression models allowing for continuous or categorical covariates/predictors and their interaction. The continuous predictors come in two types: normally distributed or skewed (i.e. Ōá2¬†¬†with 1 degree of freedom). It currently only supports binary categorical covariates/predictors (i.e. Bernoulli-distributed) but¬†with the option to manipulate the probability parameter¬†p to simulate imbalance of the groups.

The app will give you the power for each individual covariate/predictor¬† AND the variance component for the intercept (if you choose to fit a random-intercept model) or the slope (if you choose to fit a model with both a random intercept and a random slope). It uses the Wald test statistic for the fixed effect predictors and a 1-degree-of-freedom likelihood-ratio test for the random effects (‚Üź yes, I know this is conservative but it’s the fastest one to implement).

When you open the app, here’s how it looks:

screen1

What **you**, as the user, need to provide is the following:

SampleSizes

The Level 1 and Level 2 sample sizes. If I were to use the ubiquitous example of “children in schools” the Level 1 sample would be the children (individuals within a cluster) and the Level 2 sample would be the schools¬† (number of clusters). For demonstration purposes here I’m asking for groups of 50 ‘children’ in 10 ‘schools’ for a total sample size of 50√ó10 = 500 children.

RandomEffects

The¬†variance for the random effects. You can either choose to fit an intercept-only model (so no variance of the slope) or a random intercept AND random slope model. You cannot fit a random-slope only model here and you cannot set the variances at 0 to fit a single-level logistic regression (there’s other software to do power analysis for single-level logistic regression). At least the variance of the intercept needs to be specified. Notice that the app defaults to an intercept-only model and under ‘Select Covariate’ it will say ‘None’. That changes when you click on the drop-down menu where it gives you the option of which random slope do you want. Notice that you can only choose one predictor to have a random slope. Will work on the general case in the future.

Covariates1

The number of covariates (or predictors) which I believe is pretty self-explanatory. Just notice that the more covariates you add, the longer it will take for the simulation to run. The default in the app is 2 covariates.

Covariates2

This would be the core of the simulation engine because the user needs to specify:

  • Regression coefficients (‘Beta’).¬† This space lets the user specify the effect size for the regression coefficients under investigation. The default is 0.5 but that can be changed to any number. In the absence of any outside guidance, Cohen’s small-medium-large effect sizes are recommended. Remember that the regression coefficient for binary predictors is conceptualized as a standardized mean difference so it should be in Cohen’s d metric.
  • Level of the predictor (‘Level’).¬†It only supports 2-level models so the options are ‘1’ or ‘2’. This section indicates whether a predictor belongs to the Level 1 sample (e.g. the ‘children’) or the Level 2 sample (e.g. the ‘school). Notice that whichever predictor gets assigned a random slope MUST also be selected as Level 1. Otherwise the power analysis results will not make sense. It currently only supports one predictor at the Level 1 with a random slope. Other predictors can be included at Level 1 but they won’t have the option for a random slope component.
  • Distribution of the covariates (‘Distribution’).¬† Offers 3 options: normally-distributed, skewed¬†¬†(i.e. Ōá2¬†¬†with 1 degree of freedom or a skew of about ‚ąö8) and binary/Bernoulli-distributed. For the binary predictor the user can change the population parameter p¬†and create imbalance between the groups. So, for instance, if¬†p=0.3 then 30% of the sample would belong to the group labelled as ‘1’ and 70% to the group labelled as ‘0’. The default for this option is 0.5 to create an even 50/50 split.
  • Intercept (‘Intercept Beta’).¬† Lets the user define the intercept for the regression model. The default is 0 and I wouldn’t recommend changing it unless you’re making inferences about the intercept of the regression model.

Covariates3

Once the number of covariates has been selected, the app will offer the user¬†all possible 2-way interaction effects¬†irrespective of the level of the predictor and distribution characteristics. The user can select whichever 2-way interaction is of interest and assign an effect size/regression coefficient (i.e. ‘Beta’). The app will use this effect size to calculate power. Notice that the distribution of the interaction is fully defined by the distribution of its constituting main effects.

simulationRuns

The number of datasets generated using the population parameters previously defined by the researcher. The default is 10 but I would personally recommend a minimum of 100. The larger the number of replications the more accurate the results will be but also the longer the simulation will take.

simulationRuns2

The simulated power is calculated as the proportion of statistically significant results out of the number of simulated datasets and will be printed here. Notice the time progress bar indicating that the simulation is still running. For a 2-covariate model with both a random effect for the intercept and the slope the simulation took almost 3 min to run. Expect longer waiting times if the model has lots of covariates.

simulationRuns3

This is what a sample of a full power analysis looks like. The estimated power can be found under the column ‘Power’. The column labelled ‘NA’ shows the proportion of models that did not converge. In this case, all models converged (there are 0s all throughout the NA column) but the power of the fixed and random effects is relatively low with the exception of the power for the variance of the random intercept. In this example one would need to either increase the effect size from 0.5 to something larger or increase the Level 1 and Level 2 sample sizes in order to obtain acceptable power levels of 80%. You can either download your power analysis results as a .csv file or copy-paste them by clicking on the appropriate button.

Finally, here is the link for the shiny web app:

YOU CAN CLICK HERE TO ACCESS THE APP

 

If you’re an R user and would either like to see the code that runs underneath or would prefer to work directly with it for your simulation, you can check it out on my github account.
If this app is of any use to you and you’d like to cite it, please also cite the lme4, simglm and paramtest R packages. This is really just a shiny wrapper for the 3 packages put together. Those 3 packages are the ones doing most of the “heavy-lifting” when it comes to the simulation and calculations.

Ordinal Alpha and Parallel Analysis

This shiny app will:

– Give you the polychoric (or tetrachoric, in case of binary data) correlation matrix

– Do Parallel Analysis and a scree plot based on the polychoric (or tetrachoric) correlation matrix

– Calculate ordinal alpha as recommended in:

Zumbo, B. D., Gadermann, A. M., & Zeisser, C.. (2007). Ordinal Versions of Coefficients Alpha and Theta For Likert Rating Scales. Journal of Modern Applied Statistical Methods, 6, 21-29.

It currently takes in certain SPSS files (so .sav file extensions from older versions of SPSS, say around 2013 or less), Microsoft Excel files (so .xls file extensions) and comma-delimited files (so .csv extensions). If your data is in none of those files, please change it before using the app (it’s super easy). or it will give you an error. Also notice that the app will use ALL of the variables in the file uploaded, so make sure you upload a file that only has the variables (test items in most cases) which you want to correlate/calculate alpha for. You’ll need to provide a clean dataset for it to work. So if you have missing values, you’ll need to manually remove them before submitting it. If there are outliers, those need to be dealt with before using the app.

Please notice that in accordance to research, if you have 8 (or more) Likert responses, the app will give you an error saying you have enough categories to safely treat your variables as continuous, so you don’t really need to use this app. You can see why in¬†Rhemtulla,¬†Brosseau-Liard & Savalei (2012).

YOU CAN CLICK HERE TO ACCESS THE APP