StatLab Articles

Graphical Linearity Assessment for One- and Two-Predictor Logistic Regressions

Logistic regression is a flexible tool for modeling binary outcomes. A logistic regression describes the probability, \(P\), of 1/“yes”/“success” (versus 0/“no”/“failure”) as a linear combination of predictors:

\[log(\frac{P}{1-P}) = B_0 + B_1X_1 + B_2X_2 + ... + B_kX_k\]

R, logistic regression, simulation, model assessment, statistical methods, Jacob Goldstein-Greenwood

Why Preallocate Memory in R Loops?

In R, “growing” an object—extending an atomic vector one element at a time; adding elements one by one to the end of a list; etc.—is an easy way to elicit a mild admonishment from someone reviewing or revising your code. Growing most frequently occurs in the context of for loops: A loop computes a value (or set of values) on each iteration, and it then appends the value(s) to an existing object.

R, simulation, preallocation, optimization, Jacob Goldstein-Greenwood

Regression to the Mean and Change Score Analysis

Regression to the mean refers to a phenomena where natural variation within an individual can mistakenly appear as meaningful change over time. To illustrate, imagine a patient who comes in for a regular check-up and is found to have high blood sugar levels. This may be cause for concern, and the doctor recommends several dietary adjustments and schedules a follow-up for the next week. During the follow-up visit, the patient’s blood sugar levels have seemingly returned to a normal range.

statistical methods, R, simulation, Laura Jamison

Simulating Multilevel Data

The term “multilevel data” refers to data organized in a hierarchical structure, where units of analysis are grouped into clusters. For example, in a cross-sectional study, multilevel data could be made up of individual measurements of students from different schools, where students are nested within schools. In a longitudinal study, multilevel data could be made up of multiple time point measurements of individuals, where time points are nested within individuals.

mixed effect models, simulation, R, lme4, statistical methods, Laura Jamison

How to Use Docker for Study Reproducibility with R Markdown

Docker is a software product that allows for the efficient building, packaging, and deployment of applications. It uses containers, which are isolated environments that bundle software and its dependencies. These containers can run an application with all the same software, dependencies, settings, and more as were on the original machine on any other computer without affecting the host system. In this regard Docker is different from a virtual machine in that it does not require a guest operating system.

R, R Markdown, Docker, reproducibility, Laura Jamison

Theil-Sen Regression: Programming and Understanding an Outlier-Resistant Alternative to Least Squares

Least squares is so frequently the method by which linear regressions are estimated that in many write-ups of analyses, explicit mention of the method is omitted. Authors save the ink or pixels otherwise consumed by “least squares” and let it simply be inferred. This is an understandable elision: You could make good money repeatedly betting that when someone says that they fit a linear regression, they did so via least squares. But alternative estimation methods are on offer—and are sometimes preferable.

R, simulation, statistical methods, nonparametric statistics, Theil-Sen regression, Jacob Goldstein-Greenwood

Getting Started with Analysis of Covariance

The Analysis of Covariance, or ANCOVA, is a regression model that includes both categorical and numeric predictors, often just one of each. It is commonly used to analyze a follow-up numeric response after exposure to various treatments, controlling for a baseline measure of that same response. For example, given two subjects with the same baseline value of the study outcome, one in a treated group and the other in a control group, will the subjects have different follow-up outcomes on average?

R, effect plots, power analysis, statistical methods, ANCOVA, ANOVA, Clay Ford

Bootstrap Estimates of Confidence Intervals

Bootstrapping is a statistical procedure that utilizes resampling (with replacement) of a sample to infer properties of a wider population.

Python, statistical methods, confidence intervals, bootstrap, Samantha Lomuscio

Getting Started with Simple Slopes Analysis

A Simple Slopes Analysis is a follow-up procedure to regression modeling that helps us investigate and interpret “significant” interactions. The analysis is often employed for interactions between two numeric predictors, but it can be applied to other types of interactions as well. To motivate why we might be interested in this type of analysis, consider the following research question:

Does the length of time in a managerial position (X) and a manager’s ability (Z) help explain or predict a manager’s self-assurance (Y)?

R, effect plots, statistical methods, visualization, simple slopes, interactions, Clay Ford

The Shortcomings of Standardized Regression Coefficients

Analysts and researchers occasionally want to compare the magnitudes of different predictive or causal effects estimated via regression. But comparison is a tricky endeavor when predictor variables are measured on different scales: If y is predicted from x and z, with x measured in kilograms and z measured in years, what does the relative size of the variables’ regression coefficients communicate about which variable is “more strongly” associated with y?

R, simulation, statistical methods, standardized regression coefficients, Jacob Goldstein-Greenwood