StatLab Articles

Distribution-Free Confidence Intervals for Percentiles

Percentiles are order statistics. This means they’re determined by ordering observations from smallest to largest and then finding the value below which some percentage of the data lie. The most common percentile is the median. It’s simply the middle value (or the average of the two middle values if there are an even number of observations). Fifty percent of the data lie below the median. Other percentiles frequently of interest are the 25th and 75th percentiles. These are the data values below which lie 25 and 75 percent of the data, respectively.

R, statistical methods, confidence intervals, bootstrap, Clay Ford

Getting Started with Multiple Imputation for Longitudinal Data

Multiple Imputation (MI) is a method for dealing with missing data in a statistical analysis. The general idea of MI is to simulate values for missing data points using the data we have on hand, generating multiple new sets of complete data. We then run our proposed analysis on all the complete data sets and combine the results to obtain overall estimates. The end product is an analysis with proper standard errors and unbiased estimates.

multiple imputation, simulation, mixed effect models, R, statistical methods, Clay Ford

Addressing Multicollinearity

When a linear model has two or more highly correlated predictor variables, it is often said to suffer from multicollinearity. The danger of multicollinearity is that estimated regression coefficients can be highly uncertain and possibly nonsensical (e.g., getting a negative coefficient that common sense dictates should be positive). Multicollinearity is usually detected using variance inflation factors (VIF).

R, statistical methods, multicollinearity, ridge regression, PCA, Clay Ford

Correlation: Pearson, Spearman, and Kendall's tau

Correlation is a widely used method that helps us explore how two variables change together, providing insight into whether a relationship exists between them. For example, imagine we want to understand if there is an association between time spent studying and exam scores. Or, maybe we think that people who eat more cookies are happier. Or, we want to see if people who live near a park hear more birds singing in the morning. Correlation is a valuable tool for understanding the extent to which variables are associated.

R, correlation, statistical methods, spearman correlation, kendall tau, Lauren Brideau

Testing for Significance with Permutation-based Methods

When we perform statistical tests, we often want to obtain a p-value, which describes the probability of obtaining test results at least as extreme as the observed result, assuming that the null hypothesis is true. In other words, how likely would it be to observe an effect as large (or larger) than the observed effect from chance and chance alone if the null is true? Common statistical approaches such as t-tests, ANOVAs, and linear regression make assumptions about the data or the errors.

R, permutation, statistical methods, Ethan Kadiyala

Getting Started with Tweedie Models

Tweedie models are a special Generalized Linear Model (GLM) that can be useful when we want to model an outcome that sometimes equals 0 but is otherwise positive and continuous. Some examples include daily precipitation data and annual income. Data like this can have zeroes, often lots of zeroes, in addition to positive values. When modeling data of this nature, we may want to ensure our model does not predict negative values. We may also want to log-transform this data without dropping the zeroes. Tweedie models allow us to do both.

R, statistical methods, tweedie, simulation, zero-inflated models, Clay Ford

Getting Started with Multilevel Regression and Poststratification

Multilevel Regression and Poststratification (MRP) is a method of adjusting model estimates for non-response. By “non-response” we mean under-sampled groups in a population. For example, imagine conducting a phone survey to estimate the percentage of a population that approve of an elected official. It’s likely that certain age groups in the population will be under-sampled because they’re less likely to answer a call from an unfamiliar number. MRP allows us to analyze the data and adjust the estimate by taking the under-sampled groups into account.

R, statistical methods, mixed effect models, Bayesian methods, simulation, Clay Ford

Using Wavelets to Analyze Time Series Data

Time series data can contain a lot of information. Often, it is difficult to visually detect patterns in a time series and even harder to quantify patterns. How can we analyze our time series data to understand its underlying signals and how these signals are changing through time?

R, time series analysis, statistical methods, wavelets, Ethan Kadiyala

Understanding t-tests, ANOVA, and MANOVA

Imagine you love baking cookies and invite your friends over for a cookie party. You want to know how many cookies you should make so you ask your friends about how many cookies they think they will each eat. They respond:

  • Francesca: 5 cookies
  • Sydney: 3 cookies
  • Noelle: 1 cookie
  • James: 7 cookies
  • Brooke: 2 cookies

We take these numbers and add all of them together to estimate that about 18 cookies will be eaten in total at our party.

\[ 5 + 3 + 1 + 7 + 2 = 18 \text{ cookies total} \]

R, ANOVA, MANOVA, t-test, statistical methods, Lauren Brideau

Understanding Polychoric Correlation

Polychoric correlation is a measure of association between two ordered categorical variables, each assumed to represent latent continuous variables that have a bivariate standard normal distribution. When we say two variables have a bivariate standard normal distribution, we mean they’re both normally distributed with mean 0 and standard deviation 1, and that they have linear correlation.

R, simulation, statistical methods, polychoric correlation, maximum likelihood, Clay Ford