# Covariate adjustment in regression: direct vs indirect

This notebook shows difference between direct regression analysis of $y$ and $x$ including covariates $z$ vs. regression analysis on residual of $y$ after removing covariates $z$.

We assess these scenarios:

- There is association between $x$ and $y$, and between $z$ and $y$ but no association between $x$ and $z$
- $x$ and $z$ are associated

## Scenario 1: $x$ and $z$ are not associated

### Data simulation

Here, $x$ and $z$ have same effect size but are independent.

```
set.seed(1)
n = 500
x = rnorm(n)
z = rnorm(n)
bx = 0.2
bz = 0.2
y = x * bx + z * bz + rnorm(n)
```

### Multiple regression analysis

To fit model $y = xb + zw + e$:

```
out = lm(y~x+z)
summary(out)
```

The p-value for $x$ is 0.000181.

### Regression with residual

Now we “remove” $z$ from $y$ first:

```
res = residuals(lm(y~z))
```

Then perform regression analysis $y_{residual} = xb + e$

```
out = lm(res~x)
summary(out)
```

The p-value for $x$ is 0.000181 which the same as what we have before.

Now instead of using `residuals`

function in R, I use the code below to remove $z$ from $y$

```
z.with.intercept = cbind(1, z)
A = crossprod(z.with.intercept)
```

```
szy = as.vector(solve(A,c(y %*% z.with.intercept)))
res = y - z.with.intercept %*% szy
```

```
out = lm(res~x)
summary(out)
```

The result is the same as using `residual`

function in R.

## Scenario 2: $x$ and $z$ are associated

This is a situation of multicollinearity that causes imprecise estimate of effect of $x$ on $y$ conditional on $z$. To illustrate,

### Data simulation

Notice here the association between $x$ and $z$ are strong (even stronger than that between $x$ and $y$)

```
set.seed(1)
n = 500
x = rnorm(n)
bxz = 0.5
z = x * bxz + rnorm(n)
bx = 0.2
bz = 0.2
y = x * bx + z * bz + rnorm(n)
```

### Multiple regression analysis

To fit model $y = xb + zw + e$

```
out = lm(y~x+z)
summary(out)
```

### Remove $z$ from $y$ and regress with residual

```
res = residuals(lm(y~z))
out = lm(res~x)
summary(out)
```

The p-value for $x$ is different between these two analysis: p-value 2.82e-05 vs 0.000126.

### Remove $z$ from both $x$ and $y$

As previously established, we can use the code below to remove $z$ from both $x$ and $y$:

```
z.with.intercept = cbind(1, z)
A = crossprod(z.with.intercept)
szy = as.vector(solve(A,c(y %*% z.with.intercept)))
res.y = y - z.with.intercept %*% szy
szx = as.vector(solve(A,c(x %*% z.with.intercept)))
res.x = x - z.with.intercept %*% szx
```

```
out = lm(res.y~res.x)
summary(out)
```

Here the results are comparable with previous analysis of $y = xb + zw + e$: p-value 2.82e-05 vs 2.77e-05.

## Which version is more justified, when $x$ and $z$ are associated?

We have established that there is a difference between doing multiple regression (remove $z$ from both $x$ and $y$ then perform regression on residual) vs remove $z$ from $y$ then perform regression on residual. The question is, which version is more justified?

I think it depends on our belief about causal relationship. If both $y$ and $z$ are phenotypes and $x$ the genotype, then there is no way $z$ can cause $x$, and $z$ is potentially a mediator. In that case I think removing $z$ from $y$ and not from $x$ is justifiable. If $z$ is PC scores from PCA analysis on genotypes then indeed $z$ here is a confounder – it “causes” both genotype $x$ (specific to an ancestry) and phenotype $y$ (also specific to an ancestry). In that case, $z$ should be removed from both $x$ and $y$.