Objective

The purpose of this example is to demonstrate the similarities between two-way ANOVA and multiple linear regression models. It’s an extension of this post I wrote the other day so you’ll see quite a bit of duplication in the text, as I just edited that post to create this one.

These examples use synthetic data, so we know what the properties of the distributions are, and know what the answers should be even before we start the problem.

Create Synthetic Data from Anthropometry Tables

First, let’s simulate some data. I used anthropometry (body measurements) data from the U.S. Centers for Disease Control (CDC), focusing on HEIGHTS AND WEIGHTS of male children in different stages of development. Each row in the height (weight) table describes heights in cm (or weights in lb) for a sample (n) of people, reported as the mean for that sample and the standard error of that mean (SEM). I used n and SEM to calculate the sample standard deviations for 2, 7, and 19 year olds, and used this information to generate a synthetic dataset of 20 observations for each group.

library(tidyverse)

# data from CDC Anthropometry: https://www.cdc.gov/nchs/data/series/sr_11/sr11_252.pdf
set.seed(1234)

# generate characteristics of data
   hts <- c(91.8,125.4,177.8)
   wts <- c(31.0,58.9,185.3)
   sem <- c(0.38,0.43,0.81)
   sem.wts <- c(0.32,0.95,3.07)
   n <- c(318,193,211)
   n.wts <- c(325,216,179)
   sigmas <- sem*sqrt(n)
   sigmas.wts <- sem*sqrt(n.wts)

grp <- c(rep(2,20), rep(7,20), rep(19,20))
   
ht.2 <- rnorm(20,mean=hts[1],sd=sigmas[1])
ht.7 <- rnorm(20,mean=hts[2],sd=sigmas[2])
ht.19 <- rnorm(20,mean=hts[3],sd=sigmas[3])

wt.2 <- rnorm(20,mean=wts[1],sd=sigmas.wts[1])
wt.7 <- rnorm(20,mean=wts[2],sd=sigmas.wts[2])
wt.19 <- rnorm(20,mean=wts[3],sd=sigmas.wts[3])

df0 <- rbind(cbind(ht=ht.2,wt=wt.2), cbind(ht.7,wt.7), cbind(ht.19,wt.19))
df  <- as_tibble( cbind(grp=grp, df0) )
df$grp <- as.factor(grp)
df %>% group_by(grp) %>% summarize_all(.funs=mean)
## # A tibble: 3 x 3
##   grp      ht    wt
##   <fct> <dbl> <dbl>
## 1 2      90.1  33.0
## 2 7     122.   60.2
## 3 19    173.  184.
df %>% ggplot(aes(x=ht,y=wt)) + geom_point() + facet_wrap(~grp) + ggtitle("Scatterplots of ht/wt in Age Groups")

I picked 2, 7, and 19 year olds because there’s a good chance we can differentiate the groups by both height and weight. A two-way ANOVA with height and group used to predict weight should tell us that both height and group are significant.

Linear Models

Scatterplots are shown below with linear fits to each of the groups (2 year olds, 7 year olds, and 19 years olds). Each group has 20 observations, and the envelopes (which are based on standard error) get larger the farther away we move from the heights we actually measured. You can hover on the lines to see the equations.

library(ggiraphExtra)

fit <- lm(wt ~ ht+grp, data=df)
summary(fit)
## 
## Call:
## lm(formula = wt ~ ht + grp, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9210  -4.0445   0.5729   4.2698  14.9625 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.5637     9.5712   2.044   0.0457 *  
## ht            0.1488     0.1050   1.417   0.1620    
## grp7         22.4657     3.9369   5.706 4.54e-07 ***
## grp19       138.8394     8.9028  15.595  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.573 on 56 degrees of freedom
## Multiple R-squared:  0.9908, Adjusted R-squared:  0.9903 
## F-statistic:  2003 on 3 and 56 DF,  p-value: < 2.2e-16
ggPredict(fit, se=TRUE, interactive=TRUE)

One-Way ANOVA

Since we know the three distributions have very little overlap, we know we’re going to reject the null hypothesis that there is no difference in weight between the groups. (It’s clearly not.) We also find that height is significantly different between the groups. ANOVA gives us the tiny p-values [Pr(>F)] we expect:

z <- aov(wt ~ ht+grp, data=df)
summary(z)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## ht           1 235634  235634  5453.1 <2e-16 ***
## grp          2  23958   11979   277.2 <2e-16 ***
## Residuals   56   2420      43                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-statistic associated with this analysis of variance can be calculated by pulling values out of the table above:

\[F = \frac{Model\:SS/Model\:DF}{Residuals\:SS/Residuals\:DF} \] \[F = \frac{(235634+23958)/(1+2)}{2420/56} = 2002.4\]

Linear Model with an Intercept

It works out the same, pretty much, with a linear model:

fit1 <- lm(wt~1+ht+grp, data=df)  
summary(fit1)
## 
## Call:
## lm(formula = wt ~ 1 + ht + grp, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9210  -4.0445   0.5729   4.2698  14.9625 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.5637     9.5712   2.044   0.0457 *  
## ht            0.1488     0.1050   1.417   0.1620    
## grp7         22.4657     3.9369   5.706 4.54e-07 ***
## grp19       138.8394     8.9028  15.595  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.573 on 56 degrees of freedom
## Multiple R-squared:  0.9908, Adjusted R-squared:  0.9903 
## F-statistic:  2003 on 3 and 56 DF,  p-value: < 2.2e-16

Notice how grp7 and grp19 are in that first column? The missing group, representing the 2 year olds, is the reference group. The F-statistic that we calculated above from the ANOVA results (2002.4) is also shown here (2003) which indicates that the entire model is significant.

The equation for this linear fit, which shows group membership (but not height) are significant predictors, is:

\[y = 19.56 + (0.15 \:*\: {ht}) + (22.47 \:*\: {grp7}) + (138.84 \:*\: {grp19})\]

  • If we want to predict the height of the 19 year olds, grp7 and grp2 are both going to be zero as a 19 year old is not in either of these groups, and grp19 will be 1. The linear model reduces to y=19.56+(0.15ht)+138.84 or y=0.15ht+158.4. This is the equation you will see if you hover over the blue line in the colored scatterplot above.
  • If we want to predict the height of the 2 year olds, grp2 equals 1 but grp7 and grp19 are zero. The model reduces to y=0.15ht+19.56, which is what you will see if you hover over the pink line in the colored scatterplot.
  • If we want to predict the height of the 7 year olds, grp7 equals 1 but grp2 and grp19 are zero. The model reduces to y=19.56+0.15ht+22.47 or y=0.15ht+42.03, which is what you will see if you hover over the green line in the colored scatterplot.