The Evolution and Impact of Genotype Imputation Methods in GWAS
Genetic imputation is a key component of genetic association studies, where it increases the power of studies, facilitates meta-analysis, and aids in interpreting signals. It is essentially the process of inferring untyped genotypes from known genotype data, commonly used in genome-wide association studies (GWAS). This is achieved by leveraging the fact that nearby genetic markers tend to be inherited together (linkage disequilibrium). Genotype imputation efficiently increases the effective sample size, improving the statistical power to detect disease variants with moderate effects.
Algorithms in Genetic Imputation
The two main categories of genotype imputation strategies are population-based methods and family-based methods.
Population-Based Methods: These utilize linkage disequilibrium information from a reference panel of subjects with complete observations on a comprehensive set of single nucleotide polymorphisms (SNPs). One such method is IMPUTE2, which uses both reference and study samples for haplotype phasing at observed markers. SHAPEIT2 duoHMM further improves upon this by incorporating a hidden Markov model for identity by descent (IBD) in family-based studies, subsequently used in an IMPUTE2 analysis.
Family-Based Methods: These methods use inheritance information within pedigrees. For instance, Merlin uses pedigree structure to identify inheritance vectors within a family, then propagates genotypes at high-density markers observed in some individuals to others in the pedigree. Another method, GIGI (Genotype Imputation Given Inheritance), employs a two-stage procedure for inferring inheritance vectors at sparse markers, followed by Monte Carlo Markov Chain (MCMC) sampling to estimate genotypes of a dense marker set.
Importance of Genetic Imputation
Genetic imputation is vital for understanding the genetic basis of diseases and traits. It allows researchers to:
Study genetic variants that were not directly genotyped in a study, particularly useful for rare variants.
Combine data from different studies, increasing the effective sample size and power of genetic analyses.
Interpret genetic associations in the context of the broader genetic architecture of a trait or disease.
Efficiency of Imputation Strategies
The efficiency of an imputation strategy depends on various factors like the size of the dataset, the genetic structure of the population, and the computational resources available. For example, IMPUTE2, with its pre-phasing strategy (SHAPEIT-IMPUTE2), is known to be more effective for larger datasets. However, for smaller datasets, other methods like MaCH might be superior. Interestingly, the efficiency of pre-phasing strategies tends to decrease with smaller sample sizes.
Comparison of Familial and Population-Based Imputation
Family-based imputation (FBI), particularly with methods like MERLIN and GIGI, generally provides better imputation quality when there are informative family members with dense sequence data available. In contrast, population-based imputation (PBI), such as IMPUTE2, is more effective for common variants, especially when using a closely related reference panel. However, PBI has limitations in explaining all genetic variation associated with diseases and risk factors, particularly for less frequent variants.
GIGI2: An Advanced Family-Based Imputation Method
GIGI2, an advanced version of the GIGI method, significantly reduces computational time and memory usage, making it more efficient for imputing genotypes in large pedigrees. This improvement is particularly useful in family-based GWAS, where it can detect rare variants associated with complex traits more efficiently.
Conclusion
Genetic imputation is a crucial tool in genetic research, particularly in GWAS, as it enables the analysis of untyped genetic variants and enhances the power of genetic association studies. Both familial and population-based methods have their unique strengths and are chosen based on the specific requirements of the study, such as the availability of family data, the size of the dataset, and the types of variants of interest.
Reference:
Das, S., Forer, L., Schönherr, S., Sidore, C., Locke, A. E., Kwong, A., ... & Fuchsberger, C. (2016). Next-generation genotype imputation service and methods. Nature genetics, 48(10), 1284-1287.
Liu, C. T., Deng, X., Fisher, V., Heard-Costa, N., Xu, H., Zhou, Y., ... & Cupples, L. A. (2019). Revisit population-based and family-based genotype imputation. Scientific reports, 9(1), 1800.
Ullah, E., Kunji, K., Wijsman, E. M., & Saad, M. (2019). GIGI2: A Fast Approach for Parallel Genotype Imputation in Large Pedigrees. bioRxiv, 533687.