Allele flip when merging genotype data
Problem
Suppose we have some genetic variants measured in two sets of individuals and we would like to merge these genotype data.
Ideally, a variant, given by chrom
, position
, reference allele
, alternative allele
, would easily merge if information between
two data-sets are consistent. That is, the variant has the same 4 quantities for the different data-sets. However in reality there are
a few problems:
-
Scenario 1: When genotype data are given in PLINK format, there is no information on which allele is the reference and which is alternative allele. Instead all you get is “major” and “minor” allele, which is determined by the samples in the data-set. For example suppose a variant is:
1:124235:A:G # chrom:pos:ref:alt
and in the first data-set 60% alleles are
A
, whereas in the 2nd data-set only 45% areA
. Then in the PLINKbim
file the major allele will be A for data-set 1, and G for data-set 2. Even though in this case the two variants will appear1:124235:A:G # chrom:pos:major:minor
and
1:124235:G:A # chrom:pos:major:minor
in different data-sets, they are still the same variant. Software such as PLINK will handle these major/minor allele inconsistency by internally flipping genotype coding when the variants are merged, as long as we make the variant ID the same for the two data-sets in PLINK
bim
file (via some simple programming) so PLINK knows they are the same variant. -
Scenario 2: Sometimes a variant, measured on different strand due to differences in genotyping platform, can have different allele representations. For example in the first data-set:
1:124235:A:G
and the second:
1:124235:T:C
but in fact these two variants are the same because alleles A/G on their complementary DNA strand is
T/C
. In practise what we do is to change the allele representation in the 2nd data to1:124235:A:G
to match and be ready to merge with the first data-set. This is called strand flip.
Notice that the 2nd data-set may have
1:124235:C:T
which flips to1:124235:G:A
, and falls in scenario #1 the major minor allele inconsistency which is straightforward to iron out with tools like PLINK. -
Scenario 3: Complication arise when it comes to specificly A/T and C/G genotypes. Suppose a variant in the first data-set:
1:124236:A:T
and the second:
1:124236:T:A
If it is a strand flip, the 2nd data-set on the complementary strand is
1:124236:A:T
which can be directly merged with the first data-set. However, it can also be a major/minor allele inconsistency in which case we do not flip the strand but will rely on PLINK to flip the allele coding while it merges the data. Since we cannot tell one situation from another, and the genotypes in these two situations will be handled differently, in practice we simply remove variants with A/T alleles from consideration.
Same holds for C/G — you can work that out yourself to verify
-
Tri-allelic case, such as one data-set has
A:C
the other hasA:G
, orC:G
etc.
Implementation of a solution
The R code below is extracted from FUSION script by Hao Sun in our group:
allele.qc = function(a1,a2,ref1,ref2) {
# a1 and a2 are the first data-set
# ref1 and ref2 are the 2nd data-set
# Make all the alleles into upper-case, as A,T,C,G:
a1 = toupper(a1)
a2 = toupper(a2)
ref1 = toupper(ref1)
ref2 = toupper(ref2)
# Strand flip, to change the allele representation in the 2nd data-set
strand_flip = function(ref) {
flip = ref
flip[ref == "A"] = "T"
flip[ref == "T"] = "A"
flip[ref == "G"] = "C"
flip[ref == "C"] = "G"
flip
}
flip1 = strand_flip(ref1)
flip2 = strand_flip(ref2)
snp = list()
# Remove strand ambiguous SNPs (scenario 3)
snp[["keep"]] = !((a1=="A" & a2=="T") | (a1=="T" & a2=="A") | (a1=="C" & a2=="G") | (a1=="G" & a2=="C"))
# Remove non-ATCG coding
snp[["keep"]][ a1 != "A" & a1 != "T" & a1 != "G" & a1 != "C" ] = F
snp[["keep"]][ a2 != "A" & a2 != "T" & a2 != "G" & a2 != "C" ] = F
# as long as scenario 1 is involved, sign_flip will return TRUE
snp[["sign_flip"]] = (a1 == ref2 & a2 == ref1) | (a1 == flip2 & a2 == flip1)
# as long as scenario 2 is involved, strand_flip will return TRUE
snp[["strand_flip"]] = (a1 == flip1 & a2 == flip2) | (a1 == flip2 & a2 == flip1)
# remove other cases, eg, tri-allelic, one dataset is A C, the other is A G, for example.
exact_match = (a1 == ref1 & a2 == ref2)
snp[["keep"]][!(exact_match | snp[["sign_flip"]] | snp[["strand_flip"]])] = F
return(snp)
}
Each input a1
, a2
, ref1
, ref2
is a list of characters of {ATCG}
.
Example usage 1: merge summary statistics data with reference genotypes
In this application, strand_flip
does not matter and only sign_flip
matters. sign_flip
is major/minor allele flip, and matters to the sign of z-scores
qc = allele.qc( sumstat$A1 , sumstat$A2 , genos$bim[,5] , genos$bim[,6] )
# Flip Z-scores for mismatching alleles
sumstat$Z[ qc$sign_flip ] = -1 * sumstat$Z[ qc$sign_flip ]
sumstat$A1[ qc$sign_flip ] = genos$bim[qc$sign_flip,5]
sumstat$A2[ qc$sign_flip ] = genos$bim[qc$sign_flip,6]
# Remove strand ambiguous SNPs (if any)
if ( sum(!qc$keep) > 0 ) {
genos$bim = genos$bim[qc$keep,]
genos$bed = genos$bed[,qc$keep]
sumstat = sumstat[qc$keep,]
}
Example usage 2: merge two genotype data-set
When two data-sets are merged with PLINK, sign_flip
(the major/minor allele flips) does not matter because PLINK will handle it internally. Only strand_flip
matters and we have to fix the alleles for those that need a strand flip.
Suppose we have 12 same variants in 2 data-sets:
Data-set one:
1 22 rs9617549 0 10874444 C T
2 22 rs5747224 0 11122151 G A
3 22 rs2456393 0 11122417 G T
4 22 rs9617550 0 10874444 A G
5 22 rs5747225 0 11122151 T C
6 22 rs2456394 0 11122417 A C
7 22 rs199668908 0 11123683 A G
8 22 rs199674018 0 11125296 A T
9 22 rs538546296 0 15282439 C G
10 22 rs551794980 0 15283826 T A
11 22 rsxxxxxxxxx 0 15284000 A C
12 22 rsyyyyyyyyy 0 15285000 A C
Data-set two:
1 22 rs9617549 0 10874444 T C
2 22 rs5747224 0 11122151 A G
3 22 rs2456393 0 11122417 T G
4 22 rs9617550 0 10874444 T C
5 22 rs5747225 0 11122151 A G
6 22 rs2456394 0 11122417 T G
7 22 rs199668908 0 11123683 C T
8 22 rs199674018 0 11125296 T A
9 22 rs538546296 0 15282439 G C
10 22 rs551794980 0 15283826 T A
11 22 rsxxxxxxxxx 0 15284000 A G
12 22 rsyyyyyyyyy 0 15285000 C G
We can tell the variants are the same by chr_pos
. For example the first variant: is from chromosome 22, and the start position is 10874444. Sometimes the variant ID (e.g.rs9617549) is not provided.
For the 12 lines (examples) of the above 2 data-sets:
Line 1-3: Scenario 1
Line 4-6: Scenario 2
Line 7: The combination of scenario 1 and scenario 2.
(Falls into scenario 1, after strand flip)
Line 8-10: Scenario 3
Line 11-12: Trialleic cases We need to merge these 2 datasets together by `chr_pos`, then apply `allele.qc` onto the merged data.
The data_merge
should look like this:
chr_pos V1.x V2.x V3.x V4.x V5.x V6.x V1.y V2.y V3.y V4.y V5.y V6.y
1 22:rs9617549 22 rs9617549 0 10874444 C T 22 rs9617549 0 10874444 T C
2 22:rs5747224 22 rs5747224 0 11122151 G A 22 rs5747224 0 11122151 A G
3 22:rs2456393 22 rs2456393 0 11122417 G T 22 rs2456393 0 11122417 T G
4 22:rs9617550 22 rs9617549 0 10874444 A G 22 rs9617549 0 10874444 T C
5 22:rs5747225 22 rs5747224 0 11122151 T C 22 rs5747224 0 11122151 A G
6 22:rs2456394 22 rs2456393 0 11122417 A C 22 rs2456393 0 11122417 T G
7 22:rs199668908 22 rs199668908 0 11123683 A G 22 rs199668908 0 11123683 C T
8 22:rs199674018 22 rs199674018 0 11125296 A T 22 rs199674018 0 11125296 T A
9 22:rs538546296 22 rs538546296 0 15282439 C G 22 rs538546296 0 15282439 G C
10 22:rs551794980 22 rs551794980 0 15283826 T A 22 rs551794980 0 15283826 T A
11 22:rsxxxxxxxxx 22 rsxxxxxxxxx 0 15284000 A C 22 rsxxxxxxxxx 0 15284000 A G
12 22:rsyyyyyyyyy 22 rsyyyyyyyyy 0 15285000 A C 22 rsyyyyyyyyy 0 15285000 C G
And we then apply allele.qc
as:
qc = allele.qc(data_merge[,'V5.x'] , data_merge[,'V6.x'] , data_merge[,'V5.y'], data_merge[,'V6.y'])
The interface of this merger, and how it works in detail can be found Here.
The results of testing the 12 SNPs are as expected
-
keep
would return FALSE for scenario 3 and triallelic cases, i.e. the last 5 examples. -
sign_flip
identifies those cases where scenario 1 is involved, i.e. TRUE for the first 3 variantes (scenario 1 only), the 7th variantes (combination of scanrio 1 and scenario 2), and scenario 3, lines 8-10. -
strand_flip
identifies those cases where scenario 2 is involved, i.e. TRUE for scenario 2 only (lines 4-6), the 7th variantes (combination of scanrio 1 and scenario 2), and scenario 3, line 8-10.