StatLab Articles

Setting up Color Palettes in R

Plotting with color in R is kind of like painting a room in your house: You have to pick some colors. R has some default colors ready to go, but it's only natural to want to play around and try some different combinations. In this article, we'll look at some ways you can define new color palettes for plotting in R.

To begin, let's use the palette() function to see what colors are currently available:

R, visualization, Clay Ford

Getting Started with Hurdle Models

Hurdle Models are a class of models for count data that help handle excess zeros and overdispersion. To motivate their use, let's look at some data in R. The following data come with the AER package. It is a sample of 4,406 individuals, aged 66 and over, who were covered by Medicare in 1988. One of the variables the data provide is number of physician office visits.

statistical methods, visualization, hurdle models, negative binomial regression, poisson regression, rootograms, R, Clay Ford

Hierarchical Linear Regression

Note: This post is not about hierarchical linear modeling (HLM; multilevel modeling). Hierarchical regression is model comparison of nested regression models.

R, linear regression, statistical methods, hierarchical regression, model comparison, Bommae Kim

Getting Started with Negative Binomial Regression Modeling

When it comes to modeling counts (i.e., whole numbers greater than or equal to 0), we often start with Poisson regression. This is a generalized linear model where a response is assumed to have a Poisson distribution conditional on a weighted sum of predictors. For example, we might model the number of documented concussions to NFL quarterbacks as a function of snaps played and the total years experience of his offensive line. However, one potential drawback of Poisson regression is that it may not accurately describe the variability of the counts.

R, statistical methods, visualization, count regression, negative binomial regression, poisson regression, rootograms, Clay Ford

Visualizing the Effects of Logistic Regression

Logistic regression is a popular and effective way of modeling a binary response. For example, we might wonder what influences a person to volunteer, or not volunteer, for psychological research. Some do, some don’t. Are there independent variables that would help explain or distinguish between those who volunteer and those who don’t? Logistic regression gives us a mathematical model that we can we use to estimate the probability of someone volunteering given certain independent variables.

R, effect plots, logistic regression, visualization, interactions, Clay Ford

Introduction to Mediation Analysis

This post intends to introduce the basics of mediation analysis and does not explain statistical details. For details, please refer to the articles at the end of this post.

Let’s say previous studies have suggested that higher grades predict higher happiness: X (grades) → Y (happiness). (This research example is made up for illustration purposes. Please don’t consider it a scientific statement.)

R, statistical methods, mediation, Bommae Kim

Reading PDF Files into R for Text Mining

Let's say we're interested in text mining the opinions of the Supreme Court of the United States. At the time of this writing, the opinions are published as PDF files at the following web page in the section titled "Opinions of the Court": https://www.supremecourt.gov/opinions/opinions.aspx. For the purposes of this introductory tutorial, we'll look at just three opinions from the 2014 term: (1) Glossip v. Gross, (2) State Legislature v.

R, text analysis, text mining, Clay Ford

Understanding Two-Way Interactions

When doing linear modeling or ANOVA it’s useful to examine whether or not the effect of one variable depends on the level of one or more variables. If it does then we have what is called an “interaction”. This means variables combine or interact to affect the response. The simplest type of interaction is the interaction between two two-level categorical variables. Let’s say we have gender (male and female), treatment (yes or no), and a continuous response measure. If the response to treatment depends on gender, then we have an interaction.

R, statistical methods, visualization, interactions, effect plots, Clay Ford

Comparing Proportions with Relative Risk and Odds Ratios

The classic two-by-two table displays counts of what may be called “successes” and “failures” versus some two-level grouping variable, such as sex (male and female) or treatment (placebo and active drug). An example of one such table is given in Agresti (1996, p. 20). The table classifies myocardial infarction (Yes/No) with treatment group (Placebo/Aspirin). The data were “taken from a report on the relationship between aspirin use and myocardial infarction (heart attacks) by the Physicians’ Health Study Research Group at Harvard Medical School.”

R, statistical methods, odds ratio, relative risk, Clay Ford

Using and Interpreting Cronbach's Alpha

Cronbach's alpha is a measure used to assess the reliability, or internal consistency, of a set of scale or test items. In other words, the reliability of any given measurement refers to the extent to which it is a consistent measure of a concept, and Cronbach’s alpha is one way of measuring the strength of that consistency.

R, Stata, statistical methods, scale reliability, SPSS, Chelsea Goforth

Research Data Services

Want updates in your inbox? Subscribe to our monthly Research Data Services Newsletter!

Library site down time: 8 - 10 a.m. on Thursday, August 21