StatLab Articles

Simulating Endogeneity

First off, what is endogeneity, and why would we want to simulate it?

Endogeneity occurs when a statistical model has an independent variable that is correlated with the error term. The reason we would want to simulate it is to understand what exactly that definition means.

Let’s first simulate ideal data for simple linear regression using R.

R, simulation, statistical methods, two stage least squares, endogeneity, Clay Ford

Understanding QQ Plots

The QQ plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a normal or exponential. For example, if we run a statistical analysis that assumes our residuals are normally distributed, we can use a normal QQ plot to check that assumption. It's just a visual check, not an air-tight proof, so it is somewhat subjective.

R, statistical methods, visualization, qqplot, Clay Ford

Stata Tip: Name Your Graphs

An important component of data analysis is graphing. Stata provides excellent graphics facility for quickly exploring and visualizing your data. For example, let's load the auto data set that comes with Stata (1978 Automobile Data) and make two scatterplots and then two boxplots:

Stata, visualization, Clay Ford

A Rule of Thumb for Unequal Variances

One of the assumptions of the Analysis of Variance (ANOVA) is constant variance. That is, the spread of residuals is roughly equal per treatment level. A common way to assess this assumption is plotting residuals versus fitted values. Recall that residuals are the observed values of your response of interest minus the predicted values of your response. In a one-way ANOVA, this is simply the observed values minus the treatment group mean. For example, below we have a plot of residuals versus fitted values for a one-way ANOVA.

R, simulation, statistical methods, ANOVA, Clay Ford

Reshaping Data from Wide to Long

When performing data analysis, we often need to "reshape" our data from wide format to long format. A common example of wide data is a data structure with one record per subject and multiple columns for repeated measures. For example:

R, Stata, data wrangling, SAS, SPSS, data reshaping, Clay Ford