The purpose of this example is to demonstrate the similarities between two-way ANOVA and multiple linear regression models. It’s an extension of this post I wrote the other day so you’ll see quite a bit of duplication in the text, as I just edited that post to create this one.

These examples use synthetic data, so we know what the properties of the distributions are, and know what the answers *should be* even before we start the problem.

First, let’s simulate some data. I used anthropometry (body measurements) data from the U.S. Centers for Disease Control (CDC), focusing on HEIGHTS AND WEIGHTS of male children in different stages of development. Each row in the height (weight) table describes heights in cm (or weights in lb) for a sample (n) of people, reported as the mean for that sample and the standard error of that mean (SEM). I used n and SEM to calculate the sample standard deviations for 2, 7, and 19 year olds, and used this information to generate a synthetic dataset of 20 observations for each group.

```
library(tidyverse)
# data from CDC Anthropometry: https://www.cdc.gov/nchs/data/series/sr_11/sr11_252.pdf
set.seed(1234)
# generate characteristics of data
hts <- c(91.8,125.4,177.8)
wts <- c(31.0,58.9,185.3)
sem <- c(0.38,0.43,0.81)
sem.wts <- c(0.32,0.95,3.07)
n <- c(318,193,211)
n.wts <- c(325,216,179)
sigmas <- sem*sqrt(n)
sigmas.wts <- sem*sqrt(n.wts)
grp <- c(rep(2,20), rep(7,20), rep(19,20))
ht.2 <- rnorm(20,mean=hts[1],sd=sigmas[1])
ht.7 <- rnorm(20,mean=hts[2],sd=sigmas[2])
ht.19 <- rnorm(20,mean=hts[3],sd=sigmas[3])
wt.2 <- rnorm(20,mean=wts[1],sd=sigmas.wts[1])
wt.7 <- rnorm(20,mean=wts[2],sd=sigmas.wts[2])
wt.19 <- rnorm(20,mean=wts[3],sd=sigmas.wts[3])
df0 <- rbind(cbind(ht=ht.2,wt=wt.2), cbind(ht.7,wt.7), cbind(ht.19,wt.19))
df <- as_tibble( cbind(grp=grp, df0) )
df$grp <- as.factor(grp)
df %>% group_by(grp) %>% summarize_all(.funs=mean)
```

```
## # A tibble: 3 x 3
## grp ht wt
## <fct> <dbl> <dbl>
## 1 2 90.1 33.0
## 2 7 122. 60.2
## 3 19 173. 184.
```

`df %>% ggplot(aes(x=ht,y=wt)) + geom_point() + facet_wrap(~grp) + ggtitle("Scatterplots of ht/wt in Age Groups")`

I picked 2, 7, and 19 year olds because there’s a good chance we can differentiate the groups by both height and weight. A two-way ANOVA with height and group used to predict weight should tell us that both height and group are significant.

Scatterplots are shown below with linear fits to each of the groups (2 year olds, 7 year olds, and 19 years olds). Each group has 20 observations, and the envelopes (which are based on standard error) get larger the farther away we move from the heights we actually measured. You can hover on the lines to see the equations.

```
library(ggiraphExtra)
fit <- lm(wt ~ ht+grp, data=df)
summary(fit)
```

```
##
## Call:
## lm(formula = wt ~ ht + grp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9210 -4.0445 0.5729 4.2698 14.9625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.5637 9.5712 2.044 0.0457 *
## ht 0.1488 0.1050 1.417 0.1620
## grp7 22.4657 3.9369 5.706 4.54e-07 ***
## grp19 138.8394 8.9028 15.595 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.573 on 56 degrees of freedom
## Multiple R-squared: 0.9908, Adjusted R-squared: 0.9903
## F-statistic: 2003 on 3 and 56 DF, p-value: < 2.2e-16
```

`ggPredict(fit, se=TRUE, interactive=TRUE)`

Since we know the three distributions have very little overlap, we know we’re going to reject the null hypothesis that there is no difference in weight between the groups. (It’s clearly not.) We also find that height is significantly different between the groups. ANOVA gives us the tiny p-values [Pr(>F)] we expect:

```
z <- aov(wt ~ ht+grp, data=df)
summary(z)
```

```
## Df Sum Sq Mean Sq F value Pr(>F)
## ht 1 235634 235634 5453.1 <2e-16 ***
## grp 2 23958 11979 277.2 <2e-16 ***
## Residuals 56 2420 43
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The F-statistic associated with this analysis of variance can be calculated by pulling values out of the table above:

\[F = \frac{Model\:SS/Model\:DF}{Residuals\:SS/Residuals\:DF} \] \[F = \frac{(235634+23958)/(1+2)}{2420/56} = 2002.4\]

It works out the same, pretty much, with a linear model:

```
fit1 <- lm(wt~1+ht+grp, data=df)
summary(fit1)
```

```
##
## Call:
## lm(formula = wt ~ 1 + ht + grp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9210 -4.0445 0.5729 4.2698 14.9625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.5637 9.5712 2.044 0.0457 *
## ht 0.1488 0.1050 1.417 0.1620
## grp7 22.4657 3.9369 5.706 4.54e-07 ***
## grp19 138.8394 8.9028 15.595 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.573 on 56 degrees of freedom
## Multiple R-squared: 0.9908, Adjusted R-squared: 0.9903
## F-statistic: 2003 on 3 and 56 DF, p-value: < 2.2e-16
```

Notice how `grp7`

and `grp19`

are in that first column? The missing group, representing the 2 year olds, is the *reference group*. The F-statistic that we calculated above from the ANOVA results (2002.4) is also shown here (2003) which indicates that the entire model is significant.

The equation for this linear fit, which shows group membership (but not height) are significant predictors, is:

\[y = 19.56 + (0.15 \:*\: {ht}) + (22.47 \:*\: {grp7}) + (138.84 \:*\: {grp19})\]

- If we want to predict the height of the 19 year olds,
`grp7`

and`grp2`

are both going to be zero as a 19 year old is not in either of these groups, and`grp19`

will be 1. The linear model reduces to y=19.56+(0.15ht)+138.84 or y=0.15ht+158.4. This is the equation you will see if you hover over the blue line in the colored scatterplot above. - If we want to predict the height of the 2 year olds,
`grp2`

equals 1 but`grp7`

and`grp19`

are zero. The model reduces to y=0.15ht+19.56, which is what you will see if you hover over the pink line in the colored scatterplot. - If we want to predict the height of the 7 year olds,
`grp7`

equals 1 but`grp2`

and`grp19`

are zero. The model reduces to y=19.56+0.15ht+22.47 or y=0.15ht+42.03, which is what you will see if you hover over the green line in the colored scatterplot.