Covariate adjustment in regression: direct vs indirect

This notebook shows difference between direct regression analysis of $y$ and $x$ including covariates $z$ vs. regression analysis on residual of $y$ after removing covariates $z$.

We assess these scenarios:

  1. There is association between $x$ and $y$, and between $z$ and $y$ but no association between $x$ and $z$
  2. $x$ and $z$ are associated

Scenario 1: $x$ and $z$ are not associated

Data simulation

Here, $x$ and $z$ have same effect size but are independent.

set.seed(1)
n = 500
x = rnorm(n)
z = rnorm(n)

bx = 0.2
bz = 0.2
y = x * bx + z * bz + rnorm(n)

Multiple regression analysis

To fit model $y = xb + zw + e$:

out = lm(y~x+z)
summary(out)


Call:
lm(formula = y ~ x + z)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3595 -0.7000 -0.0682  0.7195  3.6850 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.005709   0.045138  -0.126 0.899413    
x            0.168389   0.044637   3.772 0.000181 ***
z            0.125210   0.042722   2.931 0.003536 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.008 on 497 degrees of freedom
Multiple R-squared:  0.04229,	Adjusted R-squared:  0.03844 
F-statistic: 10.97 on 2 and 497 DF,  p-value: 2.171e-05

The p-value for $x$ is 0.000181.

Regression with residual

Now we “remove” $z$ from $y$ first:

res = residuals(lm(y~z))

Then perform regression analysis $y_{residual} = xb + e$

out = lm(res~x)
summary(out)


Call:
lm(formula = res ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3712 -0.7007 -0.0791  0.7217  3.6843 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.003807   0.045053  -0.084 0.932701    
x            0.168103   0.044556   3.773 0.000181 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.007 on 498 degrees of freedom
Multiple R-squared:  0.02779,	Adjusted R-squared:  0.02584 
F-statistic: 14.23 on 1 and 498 DF,  p-value: 0.0001808

The p-value for $x$ is 0.000181 which the same as what we have before.

Now instead of using residuals function in R, I use the code below to remove $z$ from $y$

z.with.intercept = cbind(1, z)
A = crossprod(z.with.intercept)

szy = as.vector(solve(A,c(y %*% z.with.intercept)))
res = y - z.with.intercept %*% szy

out = lm(res~x)
summary(out)


Call:
lm(formula = res ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3712 -0.7007 -0.0791  0.7217  3.6843 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.003807   0.045053  -0.084 0.932701    
x            0.168103   0.044556   3.773 0.000181 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.007 on 498 degrees of freedom
Multiple R-squared:  0.02779,	Adjusted R-squared:  0.02584 
F-statistic: 14.23 on 1 and 498 DF,  p-value: 0.0001808

The result is the same as using residual function in R.

Scenario 2: $x$ and $z$ are associated

This is a situation of multicollinearity that causes imprecise estimate of effect of $x$ on $y$ conditional on $z$. To illustrate,

Data simulation

Notice here the association between $x$ and $z$ are strong (even stronger than that between $x$ and $y$)

set.seed(1)
n = 500
x = rnorm(n)
bxz = 0.5
z = x * bxz + rnorm(n)
bx = 0.2
bz = 0.2
y = x * bx + z * bz + rnorm(n)

Multiple regression analysis

To fit model $y = xb + zw + e$

out = lm(y~x+z)
summary(out)


Call:
lm(formula = y ~ x + z)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3595 -0.7000 -0.0682  0.7195  3.6850 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.005709   0.045138  -0.126  0.89941    
x            0.205785   0.048684   4.227 2.82e-05 ***
z            0.125210   0.042722   2.931  0.00354 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.008 on 497 degrees of freedom
Multiple R-squared:  0.08025,	Adjusted R-squared:  0.07655 
F-statistic: 21.68 on 2 and 497 DF,  p-value: 9.382e-10

Remove $z$ from $y$ and regress with residual

res = residuals(lm(y~z))
out = lm(res~x)
summary(out)


Call:
lm(formula = res ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2326 -0.6927 -0.0577  0.7391  3.6922 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.003911   0.045182  -0.087 0.931062    
x            0.172701   0.044683   3.865 0.000126 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.01 on 498 degrees of freedom
Multiple R-squared:  0.02912,	Adjusted R-squared:  0.02717 
F-statistic: 14.94 on 1 and 498 DF,  p-value: 0.0001258

The p-value for $x$ is different between these two analysis: p-value 2.82e-05 vs 0.000126.

Remove $z$ from both $x$ and $y$

As previously established, we can use the code below to remove $z$ from both $x$ and $y$:

z.with.intercept = cbind(1, z)
A = crossprod(z.with.intercept)
szy = as.vector(solve(A,c(y %*% z.with.intercept)))
res.y = y - z.with.intercept %*% szy
szx = as.vector(solve(A,c(x %*% z.with.intercept)))
res.x = x - z.with.intercept %*% szx

out = lm(res.y~res.x)
summary(out)


Call:
lm(formula = res.y ~ res.x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3595 -0.7000 -0.0682  0.7195  3.6850 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.139e-17  4.504e-02   0.000        1    
res.x       2.058e-01  4.864e-02   4.231 2.77e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.007 on 498 degrees of freedom
Multiple R-squared:  0.0347,	Adjusted R-squared:  0.03276 
F-statistic:  17.9 on 1 and 498 DF,  p-value: 2.768e-05

Here the results are comparable with previous analysis of $y = xb + zw + e$: p-value 2.82e-05 vs 2.77e-05.

Which version is more justified, when $x$ and $z$ are associated?

We have established that there is a difference between doing multiple regression (remove $z$ from both $x$ and $y$ then perform regression on residual) vs remove $z$ from $y$ then perform regression on residual. The question is, which version is more justified?

I think it depends on our belief about causal relationship. If both $y$ and $z$ are phenotypes and $x$ the genotype, then there is no way $z$ can cause $x$, and $z$ is potentially a mediator. In that case I think removing $z$ from $y$ and not from $x$ is justifiable. If $z$ is PC scores from PCA analysis on genotypes then indeed $z$ here is a confounder – it “causes” both genotype $x$ (specific to an ancestry) and phenotype $y$ (also specific to an ancestry). In that case, $z$ should be removed from both $x$ and $y$.