Covariate adjustment in regression: direct vs indirect
This notebook shows difference between direct regression analysis of $y$ and $x$ including covariates $z$ vs. regression analysis on residual of $y$ after removing covariates $z$.
We assess these scenarios:
- There is association between $x$ and $y$, and between $z$ and $y$ but no association between $x$ and $z$
- $x$ and $z$ are associated
Scenario 1: $x$ and $z$ are not associated
Data simulation
Here, $x$ and $z$ have same effect size but are independent.
set.seed(1)
n = 500
x = rnorm(n)
z = rnorm(n)
bx = 0.2
bz = 0.2
y = x * bx + z * bz + rnorm(n)
Multiple regression analysis
To fit model $y = xb + zw + e$:
out = lm(y~x+z)
summary(out)
The p-value for $x$ is 0.000181.
Regression with residual
Now we “remove” $z$ from $y$ first:
res = residuals(lm(y~z))
Then perform regression analysis $y_{residual} = xb + e$
out = lm(res~x)
summary(out)
The p-value for $x$ is 0.000181 which the same as what we have before.
Now instead of using residuals
function in R, I use the code below to remove $z$ from $y$
z.with.intercept = cbind(1, z)
A = crossprod(z.with.intercept)
szy = as.vector(solve(A,c(y %*% z.with.intercept)))
res = y - z.with.intercept %*% szy
out = lm(res~x)
summary(out)
The result is the same as using residual
function in R.
Scenario 2: $x$ and $z$ are associated
This is a situation of multicollinearity that causes imprecise estimate of effect of $x$ on $y$ conditional on $z$. To illustrate,
Data simulation
Notice here the association between $x$ and $z$ are strong (even stronger than that between $x$ and $y$)
set.seed(1)
n = 500
x = rnorm(n)
bxz = 0.5
z = x * bxz + rnorm(n)
bx = 0.2
bz = 0.2
y = x * bx + z * bz + rnorm(n)
Multiple regression analysis
To fit model $y = xb + zw + e$
out = lm(y~x+z)
summary(out)
Remove $z$ from $y$ and regress with residual
res = residuals(lm(y~z))
out = lm(res~x)
summary(out)
The p-value for $x$ is different between these two analysis: p-value 2.82e-05 vs 0.000126.
Remove $z$ from both $x$ and $y$
As previously established, we can use the code below to remove $z$ from both $x$ and $y$:
z.with.intercept = cbind(1, z)
A = crossprod(z.with.intercept)
szy = as.vector(solve(A,c(y %*% z.with.intercept)))
res.y = y - z.with.intercept %*% szy
szx = as.vector(solve(A,c(x %*% z.with.intercept)))
res.x = x - z.with.intercept %*% szx
out = lm(res.y~res.x)
summary(out)
Here the results are comparable with previous analysis of $y = xb + zw + e$: p-value 2.82e-05 vs 2.77e-05.
Which version is more justified, when $x$ and $z$ are associated?
We have established that there is a difference between doing multiple regression (remove $z$ from both $x$ and $y$ then perform regression on residual) vs remove $z$ from $y$ then perform regression on residual. The question is, which version is more justified?
I think it depends on our belief about causal relationship. If both $y$ and $z$ are phenotypes and $x$ the genotype, then there is no way $z$ can cause $x$, and $z$ is potentially a mediator. In that case I think removing $z$ from $y$ and not from $x$ is justifiable. If $z$ is PC scores from PCA analysis on genotypes then indeed $z$ here is a confounder – it “causes” both genotype $x$ (specific to an ancestry) and phenotype $y$ (also specific to an ancestry). In that case, $z$ should be removed from both $x$ and $y$.