Consider the following data from the text Design and Analysis of Experiments, 7th ed. (Montgomery, 2009, Table 3.1). It has two variables: power
and rate
. power
is a discrete setting on a tool used to etch circuits into a silicon wafer. There are four levels to choose from. rate
is the distance etched measured in Angstroms per minute. (An Angstrom is one ten-billionth of a meter.) Of interest is how (or if) the power setting affects the etch rate. Below we enter the small data set by hand into R and create a quick plot. The plot suggests higher power settings lead to higher etch rates.
# Table 3.1
power <- rep(c(160, 180, 200, 220), 5)
rate <- c(575, 565, 600, 725,
542, 593, 651, 700,
530, 590, 610, 715,
539, 579, 637, 685,
570, 610, 629, 710)
etch <- data.frame(rate, power)
stripchart(rate ~ power, data = etch, pch = 1,
vertical = TRUE, xlab = "power")
A formal statistical analysis of this data requires a linear model. To perform the analysis in R, we need to define the power
variable as a factor. This tells R that power
is a categorical variable and to treat it as such in a modeling procedure. But in addition, we may want to further specify that the levels of power
are ordered. Power level 160 is less than 180, and 180 is less than 200, and so on. This is additional information we may want to account for. We can define power
as an ordered factor in R using the ordered()
function. We do that below and save the ordered factor version as powerF
. Notice that calling head()
to view the first 6 values of powerF
shows us the ordering of the levels: 160 < 180 < 200 < 220
.
etch$powerF <- ordered(etch$power)
head(etch$powerF)
[1] 160 180 200 220 160 180
Levels: 160 < 180 < 200 < 220
It’s important to note that R is storing powerF
internally as integers, which we can see by calling unclass()
or as.integer()
on the variable. (We'll make use of this fact in the next plot we create.)
unclass(etch$powerF)
[1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
attr(etch$powerF,"levels")
[1] "160" "180" "200" "220"
Now let’s formally analyze the data using the lm()
function in R. This is equivalent to performing a one-way ANOVA. Below we model rate
as a function of the ordered factor powerF
and pipe the result into the summary()
function using base R’s native forward pipe operator, |>
, introduced in version 4.1.0.
lm(rate ~ powerF, data = etch) |> summary()
Call:
lm(formula = rate ~ powerF, data = etch)
Residuals:
Min 1Q Median 3Q Max
-25.4 -13.0 2.8 13.2 25.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 617.750 4.085 151.234 < 2e-16 ***
powerF.L 113.011 8.169 13.833 2.56e-10 ***
powerF.Q 22.700 8.169 2.779 0.0134 *
powerF.C 9.347 8.169 1.144 0.2694
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.27 on 16 degrees of freedom
Multiple R-squared: 0.9261, Adjusted R-squared: 0.9122
F-statistic: 66.8 on 3 and 16 DF, p-value: 2.883e-09
At the bottom of the output is the familiar ANOVA test result. The F-statistic is large and the p-value is tiny, which provides evidence that the mean rates between the four power levels are different. (We have statistically sanctified what we saw in the plot.) In addition, the adjusted R-squared suggests power
explains about 91% of the variability we’re seeing in the etch rates. But what about the rest? What are these mysterious “powerF.L”, “powerF.Q”, and “powerF.C” labels? What do their estimates mean? What are the hypothesis tests associated with them?
The quick answers to these questions are that the L, Q, and C stand for linear, quadratic, and cubic; that the coefficients don't have a ready interpretation; and that each hypothesis is a test for trend.
The “powerF.L” entry is a test for linear trend. The null is no linear trend (i.e., a flat line). The large t-value provides evidence against this. Again, this is not a surprise. Looking at the plot above, we can visualize a straight line over the points.
The “powerF.Q” entry is a test for quadratic trend. The null is no quadratic trend (i.e., a straight line). The t-value of 2.8 provides some evidence against this. This is less obvious in the plot, though we can perhaps imagine a slight curve superimposed over the data.
The “powerF.C” entry is a test for cubic trend. The null is no cubic trend (i.e., a straight or quadratic line). The t-value of 1.1 is small, and the p-value is quite large. There doesn’t appear to be any cubic trend in the data.
We can visualize these trends using ggplot()
. Below we call geom_smooth()
three times to plot linear, quadratic and cubic lines using lm()
. Notice we have to unclass()
the powerF
variable to make this work. Defining the colors as “1”, “2”, and “3” is an easy way to force ggplot()
to create a legend and place the labels in that order. We notice there isn’t much difference between the fitted quadratic and cubic lines, visually demonstrating why the cubic trend test was not “significant.”
library(ggplot2)
ggplot(etch) +
aes(x = powerF, y = rate) +
geom_point(shape = 1) +
geom_smooth(aes(x = unclass(powerF), color = "1"),
formula = y ~ x,
method = lm, se = FALSE) +
geom_smooth(aes(x = unclass(powerF), color = "2"),
formula = y ~ poly(x, 2),
method = lm, se = FALSE) +
geom_smooth(aes(x = unclass(powerF), color = "3"),
formula = y ~ poly(x, 3),
method = lm, se = FALSE) +
scale_color_discrete("Trend", labels = c("linear", "quadratic", "cubic")) +
theme_classic()
If you’re thinking these tests for trend look a lot like a polynomial model, you’re right. That’s basically what it is. In fact, we get identical test results (i.e., the same test statistics and p-values) if we simply fit a 3-degree polynomial model to the original numeric power
variable.
lm(rate ~ poly(power, 3), data = etch) |> summary()
Call:
lm(formula = rate ~ poly(power, 3), data = etch)
Residuals:
Min 1Q Median 3Q Max
-25.4 -13.0 2.8 13.2 25.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 617.750 4.085 151.234 < 2e-16 ***
poly(power, 3)1 252.700 18.267 13.833 2.56e-10 ***
poly(power, 3)2 50.759 18.267 2.779 0.0134 *
poly(power, 3)3 20.900 18.267 1.144 0.2694
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.27 on 16 degrees of freedom
Multiple R-squared: 0.9261, Adjusted R-squared: 0.9122
F-statistic: 66.8 on 3 and 16 DF, p-value: 2.883e-09
We also get the same test results if we unclass()
the ordered factor version of power
and fit a 3-degree polynomial model.
lm(rate ~ poly(unclass(powerF), 3), data = etch) |> summary()
Call:
lm(formula = rate ~ poly(unclass(powerF), 3), data = etch)
Residuals:
Min 1Q Median 3Q Max
-25.4 -13.0 2.8 13.2 25.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 617.750 4.085 151.234 < 2e-16 ***
poly(unclass(powerF), 3)1 252.700 18.267 13.833 2.56e-10 ***
poly(unclass(powerF), 3)2 50.759 18.267 2.779 0.0134 *
poly(unclass(powerF), 3)3 20.900 18.267 1.144 0.2694
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.27 on 16 degrees of freedom
Multiple R-squared: 0.9261, Adjusted R-squared: 0.9122
F-statistic: 66.8 on 3 and 16 DF, p-value: 2.883e-09
Did you notice the coefficients and standard errors of these two models differ from the first model with the ordered factor? That’s because the ordered factor model uses a contrast. A contrast is a matrix that transforms a series of 0/1 dummy variables into columns that can be estimated in a modeling routine. The default contrast for ordered factors in R is the polynomial contrast. We can see the contrast R uses by calling the contr.poly()
function. Simply tell it how many levels your categorical variable has. In our case we have 4.
contr.poly(4)
.L .Q .C
[1,] -0.6708204 0.5 -0.2236068
[2,] -0.2236068 -0.5 0.6708204
[3,] 0.2236068 -0.5 -0.6708204
[4,] 0.6708204 0.5 0.2236068
That’s the contrast used to convert dummy variables of an ordered factor into a model matrix. Let’s demonstrate. Below we generate a 4-column matrix of dummy variables for the power
variable. There is one row per observation and one column per level of power
. A 1 in a row identifies membership in one of the four power
categories.
Xpower <- sapply(c(160, 180, 200, 220),
function(x)ifelse(etch$power==x, 1, 0))
colnames(Xpower) <- c("160", "180", "200", "220")
head(Xpower)
160 180 200 220
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
[5,] 1 0 0 0
[6,] 0 1 0 0
Now we multiply the dummy variable matrix by the polynomial contrast to obtain a model matrix that is estimable.
Xpower %*% contr.poly(4)
.L .Q .C
[1,] -0.6708204 0.5 -0.2236068
[2,] -0.2236068 -0.5 0.6708204
[3,] 0.2236068 -0.5 -0.6708204
[4,] 0.6708204 0.5 0.2236068
[5,] -0.6708204 0.5 -0.2236068
[6,] -0.2236068 -0.5 0.6708204
[7,] 0.2236068 -0.5 -0.6708204
[8,] 0.6708204 0.5 0.2236068
[9,] -0.6708204 0.5 -0.2236068
[10,] -0.2236068 -0.5 0.6708204
[11,] 0.2236068 -0.5 -0.6708204
[12,] 0.6708204 0.5 0.2236068
[13,] -0.6708204 0.5 -0.2236068
[14,] -0.2236068 -0.5 0.6708204
[15,] 0.2236068 -0.5 -0.6708204
[16,] 0.6708204 0.5 0.2236068
[17,] -0.6708204 0.5 -0.2236068
[18,] -0.2236068 -0.5 0.6708204
[19,] 0.2236068 -0.5 -0.6708204
[20,] 0.6708204 0.5 0.2236068
Notice this is the same model.matrix()
that is used when we model the ordered factor with lm()
. (The column of ones is for the intercept.)
lm(rate ~ powerF, data = etch) |> model.matrix()
(Intercept) powerF.L powerF.Q powerF.C
1 1 -0.6708204 0.5 -0.2236068
2 1 -0.2236068 -0.5 0.6708204
3 1 0.2236068 -0.5 -0.6708204
4 1 0.6708204 0.5 0.2236068
5 1 -0.6708204 0.5 -0.2236068
6 1 -0.2236068 -0.5 0.6708204
7 1 0.2236068 -0.5 -0.6708204
8 1 0.6708204 0.5 0.2236068
9 1 -0.6708204 0.5 -0.2236068
10 1 -0.2236068 -0.5 0.6708204
11 1 0.2236068 -0.5 -0.6708204
12 1 0.6708204 0.5 0.2236068
13 1 -0.6708204 0.5 -0.2236068
14 1 -0.2236068 -0.5 0.6708204
15 1 0.2236068 -0.5 -0.6708204
16 1 0.6708204 0.5 0.2236068
17 1 -0.6708204 0.5 -0.2236068
18 1 -0.2236068 -0.5 0.6708204
19 1 0.2236068 -0.5 -0.6708204
20 1 0.6708204 0.5 0.2236068
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$powerF
[1] "contr.poly"
The latter two models that explicitly modeled a 3-degree polynomial did not use a contrast. Instead, the actual power
values were squared and cubed.
Is there an advantage to using one over the other? Maybe we should have just used polynomial regression instead of creating an ordered factor. Fox and Weisberg (2019) tell us contr.poly()
is appropriate when the levels of your ordered factor are equally spaced and balanced. That is the case for the power
variable. Each level of power
is separated by 20 units, and we have five observations for each level. If we have an ordered factor that does not have equal spacing and/or is not balanced, we’re better off performing polynomial regression via the poly()
function.
It’s worth noting that even though we have power
saved as an ordered factor and modeled with a polynomial contrast, we can still perform follow-up analyses usually associated with standard unordered categorical variables. For example, using the emmeans package we can make pairwise comparisons.
library(emmeans)
lm(rate ~ powerF, data = etch) |>
emmeans(specs = "powerF") |>
pairs()
contrast estimate SE df t.ratio p.value
160 - 180 -36.2 11.6 16 -3.133 0.0294
160 - 200 -74.2 11.6 16 -6.422 <.0001
160 - 220 -155.8 11.6 16 -13.485 <.0001
180 - 200 -38.0 11.6 16 -3.289 0.0216
180 - 220 -119.6 11.6 16 -10.352 <.0001
200 - 220 -81.6 11.6 16 -7.063 <.0001
P value adjustment: tukey method for comparing a family of 4 estimates
Hopefully you now have a better understanding of what’s happening in R when you set a factor as ordered and use it in a linear model.
References
- Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/
- Lenth, R. (2023). emmeans: Estimated marginal means, aka least-squares means (Version 1.8.5). https://cran.r-project.org/package=emmeans
- Montgomery, D. (2009). Design and analysis of experiments (7th ed.). Wiley.
- R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org
Clay Ford
Statistical Research Consultant
University of Virginia Library
July 01, 2021
For questions or clarifications regarding this article, contact statlab@virginia.edu.
View the entire collection of UVA Library StatLab articles, or learn how to cite.