StatLab Articles

Is R-squared Useless?

On Thursday, October 15, 2015, a disbelieving student posted on Reddit: My stats professor just went on a rant about how R-squared values are essentially useless, is there any truth to this? It attracted a fair amount of attention, at least compared to other posts about statistics on Reddit.

R, simulation, statistical methods, r-squared, Clay Ford

Fitting and Interpreting a Proportional Odds Model

The following table shows a cross tabulation of data taken from the 1991 General Social Survey that relates political party affiliation to political ideology (Agresti 1996).

R, statistical methods, ordinal logistic regression, proportional odds logistic regression, Clay Ford

Understanding Diagnostic Plots for Linear Regression Analysis

You ran a linear regression analysis and the stats software spit out a bunch of numbers. The results were significant (or not). You might think that you’re done with analysis. No, not yet. After running a regression analysis, you should check if the model works well for the data.

R, linear regression, statistical methods, visualization, diagnostic plots, regression diagnostics, qqplot, Bommae Kim

Getting Started with Quantile Regression

When we think of regression, we usually think of linear regression, the tried and true method for estimating a mean of some variable conditional on the levels or values of independent variables. In other words, we’re pretty sure the mean of our variable of interest differs depending on other variables. For example, the mean weight of 1st-year UVA males is some unknown value. But we could in theory take a random sample and discover there is a relationship between weight and height.

R, Stata, statistical methods, quantile regression, Clay Ford

Should I Always Transform My Variables to Make Them Normal?

When I first learned data analysis, I always checked normality for each variable and made sure they were normally distributed before running any analyses, such as t-test, ANOVA, or linear regression. I thought normal distribution of variables was the important assumption to proceed to analyses. That’s why stats textbooks show you how to draw histograms and QQ-plots in the beginning of data analysis in the early chapters and see if variables are normally distributed, isn’t it?

R, linear regression, statistical methods, normality assumption, Bommae Kim

Simulating Endogeneity

First off, what is endogeneity, and why would we want to simulate it?

Endogeneity occurs when a statistical model has an independent variable that is correlated with the error term. The reason we would want to simulate it is to understand what exactly that definition means.

Let’s first simulate ideal data for simple linear regression using R.

R, simulation, statistical methods, two stage least squares, endogeneity, Clay Ford

Understanding QQ Plots

The QQ plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a normal or exponential. For example, if we run a statistical analysis that assumes our residuals are normally distributed, we can use a normal QQ plot to check that assumption. It’s just a visual check, not an air-tight proof, so it is somewhat subjective.

R, statistical methods, visualization, qqplot, Clay Ford

Stata Tip: Name Your Graphs

An important component of data analysis is graphing. Stata provides excellent graphics facility for quickly exploring and visualizing your data. For example, let's load the auto data set that comes with Stata (1978 Automobile Data) and make two scatterplots and then two boxplots:

Stata, visualization, Clay Ford

A Rule of Thumb for Unequal Variances

One of the assumptions of the Analysis of Variance (ANOVA) is constant variance. That is, the spread of residuals is roughly equal per treatment level. A common way to assess this assumption is plotting residuals versus fitted values. Recall that residuals are the observed values of your response of interest minus the predicted values of your response. In a one-way ANOVA, this is simply the observed values minus the group mean. For example, below we have a plot of residuals versus fitted values for a one-way ANOVA.

R, simulation, statistical methods, ANOVA, Clay Ford

Reshaping Data from Wide to Long

When performing data analysis, we often need to “reshape” our data from wide format to long format. A common example of wide data is a data structure with one record per subject and multiple columns for repeated measures. For example:

R, Stata, data wrangling, SAS, SPSS, data reshaping, Clay Ford

Research Data Services

Want updates in your inbox? Subscribe to our monthly Research Data Services Newsletter!

Library site down time: 8 - 10 a.m. on Thursday, August 21