Clay Ford

Understanding ICC

The intraclass correlation coefficient, or ICC, summarizes the relative value of random effect groups in mixed-effect/multilevel models. To be more precise, it quantifies the amount of variability in the outcome that is due to variance between random effect groups. If you’re not sure what any of this means, this article is for you.

Mixed Effect versus Fixed Effect Models

When faced with analyzing clustered or repeated measures data, some researchers and analysts turn to mixed effect modeling. Yet others when faced with the same situation turn to fixed effect modeling. Which one you choose is usually dictated by your field of study and statistical education. Those coming from fields like Psychology, Ecology, and Education often choose mixed effect modeling, while those coming from fields like Economics and Political Science typically choose fixed effect modeling.

Distribution-Free Confidence Intervals for Percentiles

Percentiles are order statistics. This means they’re determined by ordering observations from smallest to largest and then finding the value below which some percentage of the data lie. The most common percentile is the median. It’s simply the middle value (or the average of the two middle values if there are an even number of observations). Fifty percent of the data lie below the median. Other percentiles frequently of interest are the 25th and 75th percentiles. These are the data values below which lie 25 and 75 percent of the data, respectively.

Getting Started with Multiple Imputation for Longitudinal Data

Multiple Imputation (MI) is a method for dealing with missing data in a statistical analysis. The general idea of MI is to simulate values for missing data points using the data we have on hand, generating multiple new sets of complete data. We then run our proposed analysis on all the complete data sets and combine the results to obtain overall estimates. The end product is an analysis with proper standard errors and unbiased estimates.

Addressing Multicollinearity

When a linear model has two or more highly correlated predictor variables, it is often said to suffer from multicollinearity. The danger of multicollinearity is that estimated regression coefficients can be highly uncertain and possibly nonsensical (e.g., getting a negative coefficient that common sense dictates should be positive). Multicollinearity is usually detected using variance inflation factors (VIF).

Getting Started with Tweedie Models

Tweedie models are a special Generalized Linear Model (GLM) that can be useful when we want to model an outcome that sometimes equals 0 but is otherwise positive and continuous. Some examples include daily precipitation data and annual income. Data like this can have zeroes, often lots of zeroes, in addition to positive values. When modeling data of this nature, we may want to ensure our model does not predict negative values. We may also want to log-transform this data without dropping the zeroes. Tweedie models allow us to do both.

Getting Started with Multilevel Regression and Poststratification

Multilevel Regression and Poststratification (MRP) is a method of adjusting model estimates for non-response. By “non-response” we mean under-sampled groups in a population. For example, imagine conducting a phone survey to estimate the percentage of a population that approve of an elected official. It’s likely that certain age groups in the population will be under-sampled because they’re less likely to answer a call from an unfamiliar number. MRP allows us to analyze the data and adjust the estimate by taking the under-sampled groups into account.

Understanding Polychoric Correlation

Polychoric correlation is a measure of association between two ordered categorical variables, each assumed to represent latent continuous variables that have a bivariate standard normal distribution. When we say two variables have a bivariate standard normal distribution, we mean they’re both normally distributed with mean 0 and standard deviation 1, and that they have linear correlation.

Power and Sample Size Calculations for Ordered Categorical Data

In this article we demonstrate how to calculate the sample size needed to achieve a desired power for experiments with an ordered categorical outcome. We assume the data will be analyzed with a proportional odds logistic regression model. We’ll use the R statistical computing environment and functions from the {Hmisc} package to implement the calculations.

Assessing Model Assumptions with Lineup Plots

When fitting a linear model we make two assumptions about the distribution of residuals:

Power and Sample Size Estimation for Logistic Regression

In this article we demonstrate how to use simulation in R to estimate power and sample size for proposed logistic regression models that feature two binary predictors and their interaction.

Recall that logistic regression attempts to model the probability of an event conditional on the values of predictor variables. If we have a binary response, y, and two predictors, x and z, that interact, we specify the logistic regression model as follows:

Understanding Somers' D

When it comes to summarizing the association between two numeric variables, we can use Pearson or Spearman correlation. When accompanied with a scatterplot, they allow us to quantify association on a scale from -1 to 1. But what if we have two ordered categorical variables with just a few levels? How can we summarize their association? One approach is to calculate Somers’ Delta, or Somers’ D for short.

Getting Started with Analysis of Covariance

The Analysis of Covariance, or ANCOVA, is a regression model that includes both categorical and numeric predictors, often just one of each. It is commonly used to analyze a follow-up numeric response after exposure to various treatments, controlling for a baseline measure of that same response. For example, given two subjects with the same baseline value of the study outcome, one in a treated group and the other in a control group, will the subjects have different follow-up outcomes on average?

Getting Started with Simple Slopes Analysis

A Simple Slopes Analysis is a follow-up procedure to regression modeling that helps us investigate and interpret “significant” interactions. The analysis is often employed for interactions between two numeric predictors, but it can be applied to other types of interactions as well. To motivate why we might be interested in this type of analysis, consider the following research question:

Does the length of time in a managerial position (X) and a manager’s ability (Z) help explain or predict a manager’s self-assurance (Y)?

Simulating Multinomial Logistic Regression Data

In this article we demonstrate how to simulate data suitable for a multinomial logistic regression model using R. One reason to do this is to gain a better understanding of how multinomial logistic regression models work. Another is to simulate data for the purposes of estimating power and sample size for a planned experiment that will involve a multinomial logistic regression analysis.

Understanding Precision-Based Sample Size Calculations

When designing an experiment it’s good practice to estimate the number of subjects or observations we’ll need. If we recruit or collect too few, our analysis may be too uncertain or misleading. If we collect too many, we potentially waste time and expense on diminishing returns. The optimal sample size provides enough information to allow us to analyze our research questions with confidence. The traditional approach to sample size estimation is based on hypothesis tests.

Understanding Semivariograms

I’ve heard something frightening from practicing statisticians who frequently use mixed effects models. Sometimes when I ask them whether they produced a [semi]variogram to check the correlation structure they reply “what’s that?” -Frank Harrell

Getting Started with Gamma Regression

In this article, we plan to get you up and running with gamma regression. But before we dive into that, let’s review the familiar normal distribution. This will provide some scaffolding to help us transition to the gamma distribution.

Understanding Deviance Residuals

If you have ever performed binary logistic regression in R using the glm() function, you may have noticed a summary of “Deviance Residuals” at the top of the summary output. In this article, we talk about how these residuals are calculated and what we can use them for. We also talk about other types of residuals available for binary logistic regression.

Getting Started with Bootstrap Model Validation

Let’s say we fit a logistic regression model for the purpose of predicting the probability of low infant birth weight, which is an infant weighing less than 2.5 kg. Below we fit such a model using the birthwt data set that comes with the MASS package in R. (This is an example model and not to be used as medical advice.)

We first subset the data to select four variables:

Mathematical Annotation in R

In this article, we demonstrate how to include mathematical symbols and formulas in plots created with R. This can mean adding a formula in the title of the plot, adding symbols to axis labels, annotating a plot with some math, and so on.

Comparing Mixed-Effect Models in R and SPSS

Occasionally we are asked to help students or faculty implement a mixed-effect model in SPSS. Our training and expertise is primarily in R, so it can be challenging to transfer and apply our knowledge to SPSS. In this article we document for posterity how to fit some basic mixed-effect models in R using the lme4 and nlme packages, and how to replicate the results in SPSS.

In this article we work with R 4.2.0, lme4 version 1.1-29, nlme version 3.1-157, and SPSS version 28.0.1.1.

Comparing the Accuracy of Two Binary Diagnostic Tests in a Paired Study Design

There are many medical tests for detecting the presence of a disease or condition. Some examples include tests for lesions, cancer, pregnancy, or COVID-19. While these tests are usually accurate, they’re not perfect. In addition, some tests are designed to detect the same condition, but use a different method. A recent example are PCR and antigen tests for COVID-19. In these cases we might want to compare the two tests on the same subjects. This is known as a paired study design.

Correlation of Fixed Effects in lme4

If you have ever used the R package lme4 to perform mixed-effect modeling you may have noticed the “Correlation of Fixed Effects” section at the bottom of the summary output. This article intends to shed some light on what this section means and how you might interpret it.

A Beginner’s Guide to Marginal Effects

What are average marginal effects? If we unpack the phrase, it looks like we have effects that are marginal to something, all of which we average. So let’s look at each piece of this phrase and see if we can help you get a better handle on this topic.

Power and Sample Size Analysis Using Simulation

The power of a test is the probability of correctly rejecting a null hypothesis. For example, let’s say we suspect a coin is not fair and lands heads 65% of the time.

Post Hoc Power Calculations Are Not Useful

It is well documented that post hoc power calculations are not useful (Althouse, 2020; Goodman & Berlin, 1994; Hoenig & Heisey, 2001). Also known as observed power or retrospective power, post hoc power purports to estimate the power of a test given an observed effect size. The idea is to show that a “non-significant” hypothesis test failed to achieve significance because it wasn’t powerful enough. This allows researchers to entertain the notion that their hypothesized effect may actually exist; they just needed to use a bigger sample size.

Understanding Ordered Factors in a Linear Model

Consider the following data from the text Design and Analysis of Experiments, 7th ed. (Montgomery, 2009, Table 3.1). It has two variables: power and rate. power is a discrete setting on a tool used to etch circuits into a silicon wafer. There are four levels to choose from. rate is the distance etched measured in Angstroms per minute. (An Angstrom is one ten-billionth of a meter.) Of interest is how (or if) the power setting affects the etch rate.

Getting Started with Generalized Estimating Equations

Generalized estimating equations, or GEE, is a method for modeling longitudinal or clustered data. It is usually used with non-normal data such as binary or count data. The name refers to a set of equations that are solved to obtain parameter estimates (i.e., model coefficients). If interested, see Agresti (2002) for the computational details. In this article we simply aim to get you started with implementing and interpreting GEE using the R statistical computing environment.

Getting Started with Binomial Generalized Linear Mixed Models

Binomial generalized linear mixed models, or binomial GLMMs, are useful for modeling binary outcomes for repeated or clustered measures. For example, let’s say we design a study that tracks what college students eat over the course of 2 weeks, and we’re interested in whether or not they eat vegetables each day. For each student, we’ll have 14 binary events: eat vegetables or not.

Understanding Multiple Comparisons and Simultaneous Inference

When it comes to confidence intervals and hypothesis testing there are two important limitations to keep in mind.

The significance level,¹ $\alpha$, or the confidence interval coverage, $1 - \alpha$,

only apply to one test or estimate, not to a series of tests or estimates.
are only appropriate if the estimate or test was not suggested by the data.

Let’s illustrate both of these limitations via simulation using R.

Understanding Robust Standard Errors

What are robust standard errors? How do we calculate them? Why use them? Why not use them all the time if they’re so robust? Those are the kinds of questions this post intends to address.

Getting Started with Multinomial Logit Models

Multinomial logit models allow us to model membership in a group based on known variables. For example, the operating system preferences of a university’s students could be classified as “Windows,” “Mac,” or “Linux.” Perhaps we would like to better understand why students choose one OS versus another. We might want to build a statistical model that allows us to predict the probability of selecting an OS based on information such as sex, major, financial aid, and so on. Multinomial logit modeling allows us to propose and fit such models.

Understanding Empirical Cumulative Distribution Functions

What are empirical cumulative distribution functions and what can we do with them? To answer the first question, let’s first step back and make sure we understand "distributions", or more specifically, "probability distributions".

A Basic Probability Distribution

Imagine a simple event, say flipping a coin 3 times. Here are all the possible outcomes, where H = head and T = tails:

Getting Started with Rate Models

Let’s say we’re interested in modeling the number of auto accidents that occur at various intersections within a city. Upon collecting data after a certain period of time, perhaps we notice two intersections have the same number of accidents, say 25. Is it correct to conclude these two intersections are similar in their propensity for auto accidents?

Modeling Non-Constant Variance

One of the basic assumptions of linear modeling is constant, or homogeneous, variance. What does that mean exactly? Let’s simulate some data that satisfies this condition to illustrate the concept.

Below we create a sorted vector of numbers ranging from 1 to 10 called x, and then create a vector of numbers called y that is a function of x. When we plot x vs y, we get a straight line with an intercept of 1.2 and a slope of 2.1.

Creating a SQLite Database for Use with R

When you import or load data into R, the data are stored in Random Access Memory (RAM). This is the memory that is deleted when you close R or shut off your computer. It’s very fast but temporary. If you save your data, it is saved to your hard drive. But when you open R again and load the data, once again it is loaded into RAM. While many newer computers come with lots of RAM (such as 16 GB), it’s not an infinite amount. When you open RStudio, you’re using RAM even if no data is loaded. Open a web browser or any other program and they too are loaded into RAM.

Simulating Data for Count Models

A count model is a linear model where the dependent variable is a count. For example, the number of times a car breaks down, the number of rats in a litter, the number of times a young student gets out of his seat, etc. Counts are either 0 or a positive whole number, which means we need to use special distributions to generate the data.

Simulating a Logistic Regression Model

Logistic regression is a method for modeling binary data as a function of other variables. For example we might want to model the occurrence or non-occurrence of a disease given predictors such as age, race, weight, etc. The result is a model that returns a predicted probability of occurrence (or non-occurrence, depending on how we set up our data) given certain values of our predictors. We might also be able to interpret the coefficients in our model to summarize how a change in one predictor affects the odds of occurrence.

Assessing Type S and Type M Errors

The paper Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors by Andrew Gelman and John Carlin introduces the idea of performing design calculations to help prevent researchers from being misled by statistically significant results in studies with small samples and/or noisy measurements.

Interpreting Log Transformations in a Linear Model

Log transformations are often recommended for skewed data, such as monetary measures or certain biological and demographic measures. Log transforming data usually has the effect of spreading out clumps of data and bringing together spread-out data. For example, below is a histogram of the areas of all 50 US states. It is skewed to the right due to Alaska, California, Texas and a few others.

Getting Started with Matching Methods

Note: This article demonstrates how to use propensity scores for matching data. However, propensity scores have come under fire in recent years. In their 2019 article, Why Propensity Scores Should Not Be Used for Matching, King and Nielsen argue that propensity scores increase imbalance, inefficiency, model dependence, and bias.

Getting Started with Moderated Mediation

In a previous post we demonstrated how to perform a basic mediation analysis. In this post we look at performing a moderated mediation analysis. The basic idea is that a mediator may depend on another variable called a "moderator". For example, in our mediation analysis post we hypothesized that self-esteem was a mediator of student grades on the effect of student happiness. We illustrate this below with a path diagram.

Getting started with Multivariate Multiple Regression

Multivariate Multiple Regression is a method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score. You may be thinking, "why not just run separate regressions for each dependent variable?" That's actually a good idea! And in fact that's pretty much what multivariate multiple regression does.

Visualizing the Effects of Proportional-Odds Logistic Regression

Proportional-odds logistic regression is often used to model an ordered categorical response. By "ordered", we mean categories that have a natural ordering, such as "Disagree", "Neutral", "Agree", or "Everyday", "Some days", "Rarely", "Never". For a primer on proportional-odds logistic regression, see our post, Fitting and Interpreting a Proportional Odds Model.

Getting Started with the purrr Package in R

If you're wondering what exactly the purrr package does, then this blog post is for you.

Working with Dates and Times in R Using the lubridate Package

Sometimes we have data with dates and/or times that we want to manipulate or summarize. A common example in the health sciences is time-in-study. A subject may enter a study on February 12, 2008, and exit on November 4, 2009. How many days was the person in the study? (Don’t forget 2008 was a leap year; February had 29 days.) What was the median time-in-study for all subjects?

The Wilcoxon Rank Sum Test

The Wilcoxon Rank Sum Test is often described as the non-parametric version of the two-sample t-test. You sometimes see it in analysis flowcharts after a question such as "is your data normal?" A "no" branch off this question will recommend a Wilcoxon test if you're comparing two groups of continuous measures.

So what is this Wilcoxon test? What makes it non-parametric? What does that even mean? And how do we implement it and interpret it? Those are some of the questions we aim to address in this post.

Pairwise comparisons of proportions

Pairwise comparison means comparing all pairs of something. If I have three items, A, B and C, that means comparing A to B, A to C, and B to C. Given n items, I can determine the number of possible pairs using the binomial coefficient: $$ \frac{n!}{2!(n - 2)!} = \binom {n}{2}$$ Using the R statistical computing environment, we can use the choose() function to quickly calculate this.

Getting Started with Exploratory Factor Analysis

Take a look at the following correlation matrix for Olympic decathlon data calculated from 280 scores from 1960 through 2004 (Johnson & Wichern, 2007, p. 499):

An Introduction to Loglinear Models

Loglinear models model cell counts in contingency tables. They're a little different from other modeling methods in that they don't distinguish between response and explanatory variables. All variables in a loglinear model are essentially "responses."

To learn more about loglinear models, we'll explore the following data from Agresti (1996, Table 6.3). It summarizes responses from a survey that asked high school seniors in a particular city whether they had ever used alcohol, cigarettes, or marijuana.

Setting up Color Palettes in R

Plotting with color in R is kind of like painting a room in your house: You have to pick some colors. R has some default colors ready to go, but it's only natural to want to play around and try some different combinations. In this article, we'll look at some ways you can define new color palettes for plotting in R.

To begin, let's use the palette() function to see what colors are currently available:

Getting Started with Hurdle Models

Hurdle Models are a class of models for count data that help handle excess zeros and overdispersion. To motivate their use, let's look at some data in R. The following data come with the AER package. It is a sample of 4,406 individuals, aged 66 and over, who were covered by Medicare in 1988. One of the variables the data provide is number of physician office visits.

Getting Started with Negative Binomial Regression Modeling

When it comes to modeling counts (i.e., whole numbers greater than or equal to 0), we often start with Poisson regression. This is a generalized linear model where a response is assumed to have a Poisson distribution conditional on a weighted sum of predictors. For example, we might model the number of documented concussions to NFL quarterbacks as a function of snaps played and the total years experience of his offensive line. However, one potential drawback of Poisson regression is that it may not accurately describe the variability of the counts.

Visualizing the Effects of Logistic Regression

Logistic regression is a popular and effective way of modeling a binary response. For example, we might wonder what influences a person to volunteer, or not volunteer, for psychological research. Some do, some don’t. Are there independent variables that would help explain or distinguish between those who volunteer and those who don’t? Logistic regression gives us a mathematical model that we can we use to estimate the probability of someone volunteering given certain independent variables.

Reading PDF Files into R for Text Mining

Let's say we're interested in text mining the opinions of the Supreme Court of the United States. At the time of this writing, the opinions are published as PDF files at the following web page in the section titled "Opinions of the Court": https://www.supremecourt.gov/opinions/opinions.aspx. For the purposes of this introductory tutorial, we'll look at just three opinions from the 2014 term: (1) Glossip v. Gross, (2) State Legislature v.

Understanding Two-Way Interactions

When doing linear modeling or ANOVA it’s useful to examine whether or not the effect of one variable depends on the level of one or more variables. If it does then we have what is called an “interaction”. This means variables combine or interact to affect the response. The simplest type of interaction is the interaction between two two-level categorical variables. Let’s say we have gender (male and female), treatment (yes or no), and a continuous response measure. If the response to treatment depends on gender, then we have an interaction.

Comparing Proportions with Relative Risk and Odds Ratios

The classic two-by-two table displays counts of what may be called “successes” and “failures” versus some two-level grouping variable, such as sex (male and female) or treatment (placebo and active drug). An example of one such table is given in Agresti (1996, p. 20). The table classifies myocardial infarction (Yes/No) with treatment group (Placebo/Aspirin). The data were “taken from a report on the relationship between aspirin use and myocardial infarction (heart attacks) by the Physicians’ Health Study Research Group at Harvard Medical School.”

Is R-squared Useless?

On Thursday, October 15, 2015, a disbelieving student posted on Reddit: My stats professor just went on a rant about how R-squared values are essentially useless, is there any truth to this? It attracted a fair amount of attention, at least compared to other posts about statistics on Reddit.

Fitting and Interpreting a Proportional Odds Model

The following table shows a cross tabulation of data taken from the 1991 General Social Survey that relates political party affiliation to political ideology (Agresti 1996).

Getting Started with Quantile Regression

When we think of regression, we usually think of linear regression, the tried and true method for estimating a mean of some variable conditional on the levels or values of independent variables. In other words, we’re pretty sure the mean of our variable of interest differs depending on other variables. For example, the mean weight of 1st-year UVA males is some unknown value. But we could in theory take a random sample and discover there is a relationship between weight and height.

Simulating Endogeneity

First off, what is endogeneity, and why would we want to simulate it?

Endogeneity occurs when a statistical model has an independent variable that is correlated with the error term. The reason we would want to simulate it is to understand what exactly that definition means.

Let’s first simulate ideal data for simple linear regression using R.

Understanding QQ Plots

The QQ plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a normal or exponential. For example, if we run a statistical analysis that assumes our residuals are normally distributed, we can use a normal QQ plot to check that assumption. It’s just a visual check, not an air-tight proof, so it is somewhat subjective.

Stata Tip: Name Your Graphs

An important component of data analysis is graphing. Stata provides excellent graphics facility for quickly exploring and visualizing your data. For example, let's load the auto data set that comes with Stata (1978 Automobile Data) and make two scatterplots and then two boxplots:

A Rule of Thumb for Unequal Variances

One of the assumptions of the Analysis of Variance (ANOVA) is constant variance. That is, the spread of residuals is roughly equal per treatment level. A common way to assess this assumption is plotting residuals versus fitted values. Recall that residuals are the observed values of your response of interest minus the predicted values of your response. In a one-way ANOVA, this is simply the observed values minus the group mean. For example, below we have a plot of residuals versus fitted values for a one-way ANOVA.

Reshaping Data from Wide to Long

When performing data analysis, we often need to “reshape” our data from wide format to long format. A common example of wide data is a data structure with one record per subject and multiple columns for repeated measures. For example: