Genetic Variant Association 1: A Pythonic Approach to Fisher Exact Test on VCF Files
In the challenging yet intriguing field of genetic research, associating genetic variants with traits is a crucial step that uncovers the secrets harbored in our DNA. Among the various tools available, the Fisher Exact Test is first implementations. Through this blog post, we will explore how these tests can be applied to VCF (Variant Call Format) files using Python. We will delve into a case-control cohort setup, utilizing libraries such as NumPy and gzip for data manipulation.
Understanding the Fisher Exact Test
The Fisher Exact Test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed in various fields, in genetics, it's often used to examine the association between genetic variants and phenotypic traits. The test is particularly suitable for small sample sizes.
Advantages:
Accurate for Small Samples: Unlike other tests, the Fisher Exact Test gives precise results even for small sample sizes.
No Assumption on Population Parameters: It does not assume anything about the population parameters, making it a non-parametric test.
Applicability to Sparse Data: It's applicable even when data is sparse or unevenly distributed.
Disadvantages:
Computational Intensity: The test can be computationally intensive for large datasets or contingency tables with many categories.
Two-Dimensional Limitation: Traditionally applies to 2x2 contingency tables, though extensions exist for larger tables.
Oversimplifies the genetic model by not accounting for other confounding variables.
Implementing Fisher Exact Test in Python
In Python, the fisher_exact function from the scipy.stats module can be used to perform the Fisher Exact Test. However, if you want to create a function from scratch, here's a simplified version:
from math import comb
def fisher_exact(table):
"""
Perform Fisher's Exact Test on a 2x2 contingency table.
Parameters:
table (list of list of int): 2x2 contingency table.
Returns:
float: p-value of the test.
"""
# Unpack the table values
a, b, c, d = table[0][0], table[0][1], table[1][0], table[1][1]
# Calculate the sum of rows and columns
row_sum_1, row_sum_2 = a + b, c + d
col_sum_1, col_sum_2 = a + c, b + d
total_sum = a + b + c + d
# Calculate the p-value
p_value = (comb(row_sum_1, a) * comb(row_sum_2, c)) / comb(total_sum, col_sum_1)
return p_value
In the fisher_exact function above, the comb function from the math module is used to compute the binomial coefficients. The table parameter is a 2x2 list representing the contingency table. The function unpacks the table values, computes the sums of rows and columns, and then calculates the p-value based on the formula for the Fisher Exact Test. This formula calculates the probability of observing such an arrangement of values in the contingency table, assuming the null hypothesis of no association between the rows and columns.
This simplified version is specifically for 2x2 tables and does not include the computation of the odds ratio or handling of larger contingency tables, which are features provided by the scipy.stats.fisher_exact function.
import gzip
import numpy as np
def convert_genotype(genotype):
mapping = {'0/0': 0, '0/1': 1, '1/1': 2}
return mapping[genotype]
# Load the VCF file
with gzip.open('data.vcf.gz', 'rt') as file:
data = [line.strip().split('\t') for line in file if not line.startswith('#')]
# Process genotype data and calculate zygosity for each variant
case_zygosity = []
control_zygosity = []
for variant in data:
case_genotypes = [convert_genotype(genotype.split(':')[0]) for idx, genotype in enumerate(variant[9:]) if idx % 2 == 0]
control_genotypes = [convert_genotype(genotype.split(':')[0]) for idx, genotype in enumerate(variant[9:]) if idx % 2 == 1]
case_zygosity.append(case_genotypes)
control_zygosity.append(control_genotypes)
case_zygosity = np.array(case_zygosity)
control_zygosity = np.array(control_zygosity)
# Sum zygosity for each variant in case and control groups
case_sum = np.sum(case_zygosity, axis=1)
control_sum = np.sum(control_zygosity, axis=1)
# Performing Fisher Exact Test for each variant
p_values = []
for i in range(len(case_sum)):
table = [[case_sum[i], control_sum[i]], [len(case_data[0]) - case_sum[i], len(control_data[0]) - control_sum[i]]]
p_value = fisher_exact(table)
p_values.append(p_value)
# Output the p-values
print(p_values)
In this script:
The VCF file is loaded and parsed to separate the header from the data.
For each variant, we iterate through the genotype data columns, starting from the 10th column (indexed as 9).
We use the enumerate function to get the index of each genotype data column. If the index is even, the genotype data belongs to a case; if it's odd, it belongs to a control.
A function convert_genotype is defined to convert genotype data from the VCF format to numerical values as specified (0 for 0/0, 1 for 0/1, and 2 for 1/1).
The genotype data for each variant is converted to numerical values, and the zygosity is calculated for the case and control groups.
The fisher_exact function computes the p-value for a 2x2 contingency table. The math.comb function is used to calculate binomial coefficients which are required to calculate the probabilities based on the hypergeometric distribution.
Conclusion
Through Python, we have navigated the avenues of performing genetic association tests on VCF files, showcasing the utility and limitations of the Fisher Exact Test and Gene-based Burden Test. These tools, with their distinct yet complementary capabilities, furnish researchers with a robust framework to delve deeper into the genetic architectures, particularly in familial settings demanding a more comprehensive analysis.