Loading icon

A Beginner's Dive into Gene Set Enrichment Analysis using Python

Post banner image
Share:

Creating a Gene Set Enrichment Analysis (GSEA) from scratch can be complex, especially for beginners. However, I can give you a simplified version of how you might approach this using Python, numpy, and built-in functions. This won't cover all the nuances of GSEA, but it can serve as a starting point.

  
    import numpy as np
    from scipy import stats
    
    # Assume gene_expression is a 2D numpy array where rows are genes and columns are samples,
    # and phenotype_labels is a 1D array where each element is 0 or 1, indicating the phenotype of each sample.
    def simple_gsea(gene_expression, phenotype_labels):
        num_genes, num_samples = gene_expression.shape
        assert len(phenotype_labels) == num_samples
    
        # Step 1: Rank genes based on their differential expression between phenotypes.
        phenotype_0_indices = np.where(phenotype_labels == 0)[0]
        phenotype_1_indices = np.where(phenotype_labels == 1)[0]
        mean_diffs = gene_expression[:, phenotype_1_indices].mean(axis=1) - gene_expression[:, phenotype_0_indices].mean(axis=1)
        ranks = np.argsort(mean_diffs)[::-1]  # Sort in descending order
    
        # Step 2: For each gene set, calculate the enrichment score (ES).
        # Assume gene_sets is a list of sets, where each set contains the indices of genes in a particular gene set.
        gene_sets = [set(np.random.choice(num_genes, size=50, replace=False)) for _ in range(10)]  # Random example gene sets
        enrichment_scores = []
        for gene_set in gene_sets:
            hits = np.isin(ranks, list(gene_set))
            es = (hits.cumsum() / hits.sum() - np.arange(len(hits)) / len(hits)).max()
            enrichment_scores.append(es)
    
        return np.array(enrichment_scores)
    
    # Example usage:
    gene_expression = np.random.rand(1000, 50)  # Random example data
    phenotype_labels = np.random.randint(0, 2, size=50)
    enrichment_scores = simple_gsea(gene_expression, phenotype_labels)
    print(enrichment_scores)
    

In this code:

1. We first define a function simple_gsea that takes in a gene expression matrix and a phenotype label array.
2. Inside the function:
--We find the indices of samples for each phenotype.
--We calculate the mean difference in gene expression between the two phenotypes for each gene.
--We rank the genes based on this difference.
3. We then calculate the enrichment score for each gene set:
--For simplicity, we've generated some random gene sets as an example.
--For each gene set, we find the indices of the ranked genes that are in the gene set (hits).
--We calculate the enrichment score as the maximum deviation of the running sum of hits from a straight line.

This is a very simplified version of GSEA and lacks many features of a full GSEA analysis, such as permutation testing, normalization, and multiple testing correction. It's highly recommended to use established libraries like GSEApy for a more accurate and comprehensive analysis.