# An illustration of permutation testing in GWAS

Without loss of generality, consider a genetic association study with 10 SNPs weakly associated with a quantitative trait. We would like to assess the significance of these SNPs after multiple testing correction.

## 10 SNPs are independent

Let’s simulate the genotype data first, assuming the SNPs are independent, and a sample size of $N=1,000$ unrelated individuals,

# set random seed
set.seed(999)
# Sample from a multivariate normal distribution for SNP matrix X, as if the SNP data has been normalized,

n <- 1000
p <- 10
R <- diag(p)
mu <- rep(0,p)
X <- MASS::mvrnorm(n, mu = mu, Sigma = R)



Now we simulate the 10 SNP effects, assuming 0.5% heritability in each SNP,

sigma2 <- 0.005
b <- rnorm(p, 0, sqrt(sigma2))



then simulate phenotype vector $y$ as $y = Xb + e$ where $e$ is a vector of $N(0, \sqrt{1-\sigma^2})$,

y <- X %*% b + rnorm(p, 0, sqrt(1-sigma2))



And perform association analysis,

test_assoc = function(X,y) {
res = susieR::univariate_regression(X,y)
p = pnorm(res$betahat/res$sebetahat)
return(pmin(p,1-p)*2)
}


p_orig = test_assoc(X,y)


p_orig


<ol class=list-inline>
• 0.00244222209530038
• 0.155430852452004
• 0.455153232198785
• 0.111110083396467
• 0.111854162419335
• 0.0103712166234653
• 0.514298813938308
• 0.644904338120907
• 2.94489178997992e-06
• 0.741834184053375
• </ol>

## Permutation testing scheme 1 – equivalence to analytical p-value

Here we perform permutation testing on the data as follows:

1. Permute the phenotype vector
2. Perform association tests and get the permuted p-value for each SNP
3. Compare the permuted p-value with the original p-value for each SNP. If the permutated p-value is smaller than the original p-value we call it a “success” and otherwise “failure”
4. Repeat 1 - 3 numerous times. Empirical p-value is defined by the total number of success divided by the total number of permutations

This scheme should result in a similar p-value as the original analytical p-value. Let’s use 10,000 permutations for this problem:

test_perm_1 = function(X,y,K=10000) {
p_orig = test_assoc(X,y)
res = matrix(0, K, ncol(X))
for (i in 1:K) {
p_perm = test_assoc(X,sample(y))
res[i,] = p_orig >= p_perm
}
return(apply(res, 2, mean))
}


p_perm_1 = test_perm_1(X,y)


plot(p_orig, p_perm_1) This case permutation and analytical p-values are the same.

## Permutation testing scheme 2 – minimum p-value

Here we perform permutation testing on the data as follows:

1. Permute the phenotype vector
2. Perform association tests and get the minimum p-value among the 10 permuted tests
3. Compare the minimum p-value with each of the 10 SNPs’ original p-value. If the minimum p-value is smaller than the original p-value we call it a “success” and otherwise “failure”
4. Repeat 1 - 3 numerous times. Empirical p-value is defined by the total number of success divided by the total number of permutations

Let’s use 10,000 permutations for this problem:

test_perm_2 = function(X,y,K=10000) {
p_orig = test_assoc(X,y)
res = matrix(0, K, ncol(X))
for (i in 1:K) {
p_perm = test_assoc(X,sample(y))
min_p = min(p_perm)
res[i,] = sapply(p_orig, function(x) x>=min_p)
}
return(apply(res, 2, mean))
}


p_perm_2 = test_perm_2(X,y)



If this permutation scheme is correct, we would expect that it is comparable to a Bonferroni correction of testing 10 independent hypothesis

bonferroni_correct = function(p) 1 - (1 - p) ^ length(p)


bonferroni_correct(p_orig)


<ol class=list-inline>
• 0.0241555612908793
• 0.815348804296027
• 0.997694613271454
• 0.692050291915425
• 0.694618413064428
• 0.0990033365759982
• 0.999269379859342
• 0.999968124745087
• 2.94485276454148e-05
• 0.999998684813438
• </ol>
plot(bonferroni_correct(p_orig), p_perm_2) p_perm_2


<ol class=list-inline>
• 0.0249
• 0.8174
• 0.9974
• 0.6927
• 0.6945
• 0.0998
• 0.9989
• 0.9999
• 0
• 1
• </ol>

The Bonferroni and this permutation theme give the same result because by definition, both Bonferroni and this permutation assess if the most significant association is real. By using the minimum p-value, permutation test gives the empirical distribution of the most extreme statistic.