The Role of Population Stratification in GWAS
Population stratification is a fundamental consideration in genetic association studies, especially in genome-wide association studies (GWAS). This process addresses differences in allele frequencies between cases and controls that are due to systematic ancestry differences rather than the association of genes with disease. These differences can lead to spurious associations in disease studies if not appropriately corrected.
Importance of Population Stratification
Population stratification can cause false-positive associations due to systematic differences in ancestry between cases and controls. Stratification can exist even in well-designed studies, and modest amounts of it can impact the results significantly. It has been shown that assessments based on a few dozen markers lack power to rule out moderate levels of stratification, which could cause false-positive associations in studies designed to detect modest genetic risk factors.
Methods to Detect and Correct Stratification
Genomic Control: This method is conceptually simple and involves examining the distribution of association statistics between unlinked genetic variants typed in cases and controls. The statistic at a candidate allele being tested for association is compared with the genome-wide distribution of statistics for markers probably unrelated to the disease to assess whether the candidate allele stands out. In the absence of stratification, association between unlinked genetic variants and disease should follow a χ^2 distribution with 1 degree of freedom. In the presence of stratification, the distribution of association statistics is inflated by a value termed λ, which becomes larger with increasing sample size.
Principal Components Analysis (PCA): PCA is used to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. PCA can be applied to disease studies with hundreds of thousands of markers.
STRUCTURE and Genomic Control Combination: STRUCTURE clusters samples based on multilocus genotypes to identify individuals with different ancestries. This method adjusts for ancestry as a covariate in the association analysis. Genomic Control makes a quantitative estimate of the degree of stratification and uses it to adjust for any stratification that might be present. These two methods are not mutually exclusive and can be used in combination to enhance the accuracy of stratification correction.
Methodology Steps for Population Stratification
The steps for conducting population stratification, particularly using PCA, typically involve:
Data Preparation: Organize the SNP data as a matrix with rows representing subjects and columns representing SNPs.
Outlier Detection: Identify subject outliers using robust PCA approaches, such as the GRID algorithm or the resampling by half means (RHM) approach.
Principal Components Analysis: Perform regular PCA on the SNP data matrix after removing the subject outliers. Select several top PCs.
Clustering and Assignment: Apply k-medoids clustering to the selected PCs. Determine the optimal number of clusters based on Gap statistics and assign each subject to a cluster.
Association Testing: Test each SNP’s association with the outcome of interest by building a logistic regression model. This model should include the specific SNP as one factor, the selected PCs as covariates, and the cluster membership indicators as additional factors.
In conclusion, population stratification is a critical issue in genetic association studies, particularly GWAS. Methods such as PCA and Genomic Control are essential for detecting and correcting stratification, thereby ensuring the validity and reliability of study findings. The detailed methodology for implementing these techniques involves a combination of statistical and computational approaches, making them integral to modern genetic research.
Reference:
Freedman, M. L., Reich, D., Penney, K. L., McDonald, G. J., Mignault, A. A., Patterson, N., ... & Altshuler, D. (2004). Assessing the impact of population stratification on genetic association studies. Nature genetics, 36(4), 388-393.
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., & Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics, 38(8), 904-909.
Liu, L., Zhang, D., Liu, H., & Arendt, C. (2013). Robust methods for population stratification in genome wide association studies. BMC bioinformatics, 14(1), 1-12.