A Beginner's Dive into Gene Set Enrichment Analysis using Python
Creating a Gene Set Enrichment Analysis (GSEA) from scratch can be complex, especially for beginners. However, I can give you a simplified version of how you might approach this using Python, numpy, and built-in functions. This won't cover all the nuances of GSEA, but it can serve as a starting point.
import numpy as np
from scipy import stats
# Assume gene_expression is a 2D numpy array where rows are genes and columns are samples,
# and phenotype_labels is a 1D array where each element is 0 or 1, indicating the phenotype of each sample.
def simple_gsea(gene_expression, phenotype_labels):
num_genes, num_samples = gene_expression.shape
assert len(phenotype_labels) == num_samples
# Step 1: Rank genes based on their differential expression between phenotypes.
phenotype_0_indices = np.where(phenotype_labels == 0)[0]
phenotype_1_indices = np.where(phenotype_labels == 1)[0]
mean_diffs = gene_expression[:, phenotype_1_indices].mean(axis=1) - gene_expression[:, phenotype_0_indices].mean(axis=1)
ranks = np.argsort(mean_diffs)[::-1] # Sort in descending order
# Step 2: For each gene set, calculate the enrichment score (ES).
# Assume gene_sets is a list of sets, where each set contains the indices of genes in a particular gene set.
gene_sets = [set(np.random.choice(num_genes, size=50, replace=False)) for _ in range(10)] # Random example gene sets
enrichment_scores = []
for gene_set in gene_sets:
hits = np.isin(ranks, list(gene_set))
es = (hits.cumsum() / hits.sum() - np.arange(len(hits)) / len(hits)).max()
enrichment_scores.append(es)
return np.array(enrichment_scores)
# Example usage:
gene_expression = np.random.rand(1000, 50) # Random example data
phenotype_labels = np.random.randint(0, 2, size=50)
enrichment_scores = simple_gsea(gene_expression, phenotype_labels)
print(enrichment_scores)
In this code:
1. We first define a function simple_gsea that takes in a gene expression matrix and a phenotype label array.
2. Inside the function:
--We find the indices of samples for each phenotype.
--We calculate the mean difference in gene expression between the two phenotypes for each gene.
--We rank the genes based on this difference.
3. We then calculate the enrichment score for each gene set:
--For simplicity, we've generated some random gene sets as an example.
--For each gene set, we find the indices of the ranked genes that are in the gene set (hits).
--We calculate the enrichment score as the maximum deviation of the running sum of hits from a straight line.
This is a very simplified version of GSEA and lacks many features of a full GSEA analysis, such as permutation testing, normalization, and multiple testing correction. It's highly recommended to use established libraries like GSEApy for a more accurate and comprehensive analysis.